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PREFACE 


Emphasis in this edition again is placed on the psychological 
foundations of tests and on the psychological aspects of interpreting and 
evaluating test findings, while the statistical aspects are not neglected. A 
good deal of illustrative test material, therefore, has been incorporated, 
as in the first two editions. 

This edition includes two major additions: a chapter on the historical 
background of psychological testing and one on elementary statistical 
concepts. I believe that students, especially graduate students, should have 
the perspective provided by an historical introduction to the subject. 
Better still, they should consult specialized books and original sources. 
That students should understand certain statistical concepts goes without 
saying. The chapter on statistics attempts to provide that understanding. 
Preferably, students should have a prior course in statistics, designed to 
give them insights through working on a large number of varied kinds 
of problems. For students who have had such a course, this chapter on 
statistical concepts can serve as à refresher; for others it should provide 
the information necessary to understand the statistical terms and data 


in the text. 


The sectio 
panded; materials throughout h 


ns on reliability and validity have been considerably ex- 
ave been brought up to date; some tests 
have been eliminated from discussion and others, which seem to be more 
useful in the presentation, have been added; much more attention has 
been given to multiple factor batteries. The fact that very few tests de- 
veloped in countries other than the United States are included is not an 
index of their quality or number. Space and redundancy were the de- 


terminants. 


Projective tests again are given an appreciable amount of space; for, 


though opinions vary concerning their value, projective tests are of pri- 
mary importance in current psychological practice, research, thinking, 


and controversy. Students of the general subject of psychological testing, 
vii 


vill PREFACE 
especially those whose major interests do not include clinical psychology, 
should be familiar with projectives as well as with other types of instru- 
ments. A bibliography has been appended to each chapter to provide 
additional sources of information about the tests and more readings for 
advanced students. 

I am indebted, of course, to the individuals and publishers who pro- 
vided materials and who readily granted permission to reproduce these. 
I am indebted, also, to several psychologists who contributed valuable 
suggestions for this revision; especially to Dr. Harold H. Abelson. 


F. S. F. 
Ithaca, New York 
1962 
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and 
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of 

PSYCHOLOGICAL 
TESTING 


I. 


HISTORICAL BACKGROUND 


Uses of Psychological Tests 


Psychological tests have been devised and are used primarily for 
the determination and analysis of individual differences in general intel- 
ligence, specific aptitudes, educational achievement, vocational fitness, 
and nonintellectual personality traits. Tests also have long been used 
for a variety of psychological, educational, cultural, sociological, and em- 
ployment studies of groups rather than for the study of a particular in- 
dividual. Among these studies of groups, the following have been most 
common and include the most important fields of investigation: the na- 
ture and course of mental development; intellectual and nonintellectual 
personality differences associated with age, sex, and racial membership; 
differences that might be attributed to hereditary or to environmental 
factors; differences among persons at different occupational levels and 
among their children; intellectual and other personality traits of atypical 
groups such as the mentally gifted, the mentally retarded, the neurotic, 


and the psychotic. 
Psychological tests, esp 
cific aptitudes, have had 


ecially those of general intelligence and of spe- 
very extensive use in educational classification, 
selection, and planning, from the first grade (and sometimes earlier) 
through the university. Prior to World War II, schools and colleges were 
the largest users of psychological tests. During and after World War II, 
however, so many types of tests was administered to so many men and 
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women in all branches of the military services that the armed forces, 
along with educational institutions, must now be regarded as the major 
users of psychological devices. 

When tests are used for the determination and analysis of an individ- 
ual’s intellectual abilities or nonintellectual traits, the purpose might be 
to provide educational and vocational guidance; to place an individual 
in a special class for superior pupils or in one for the mentally retarded; 
to discern weaknesses in order to provide remedial instruction; or to 
discover causes, intellectual or otherwise, which might account for be- 
havior problems in school. 

In clinics, psychological tests are used primarily for individual diag- 
nosis of factors associated with personal problems of learning, behavior, 
attitudes, or specific interpersonal relations. 

In business and industry, tests are helpful in selecting and classifying 
personnel for placement in jobs that range from the simpler semiskilled 
to the highly skilled, from the selection of filing clerks and salespersons 
to top management. For any of these positions, however, test results are 
only one source of information—though an important one. 

The foregoing discussion emphasizes the fact that psychological tests 
and testing play a significant role in a wide variety of situations and can 
significantly affect the lives of many persons. But even though they are 
significant educational, vocational, and diagnostic assets today, psycho- 


logical tests did not begin to assume appreciable significance until about 
1910-15. 


The Nineteenth Century 


Although the fact that persons differ in intellectual and other 
psychological characteristics had been apparent to observers for many 
centuries, it was only about a hundred years ago that these differences 
were first studied scientifically and subjected to measurement and ob- 
jective evaluation. 

Francis Galton (1822-1911) was the first scientist to undertake sys- 


tematic and statistical investigations of individual differences. He was 
preceded, before the middle of the nineteenth century, by other men 
Pip are Important in the history of psychology; but these men, who 
Pe onged to one of two groups, were not concerned with devising means 
measuring individual differences. Some were nonexperimental, specula- 
ORE who were concerned largely with problems of the 
A mud and matter, the nature of ideas, intellectual “faculties,” 

associationism. Others, though experimentally oriented, 
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directed their attention to general problems and theories rather than to 
variations and differences in human abilities. 

Among these was Ernst Heinrich Weber (1795-1878), educated as an 
anatomist and physiologist, who experimented on weight discrimination, 
vision, hearing, and the "two-point threshold" of the skin. He is best 
remembered for his quantitative experimental approach to psychological 
problems and for what we know as Weber's law.! Gustav Theodor 
Fechner (1801-87), who started his career in physics and chemistry, was 
basically concerned with the application of the exact methods of the 
natural sciences to the study of man's "inner world," that is, the relations 
of mental processes to physical phenomena. Johannes Müller (1801-58), 
a professor of physiology, was especially interested in the physiology of 
the senses and in reflex action. In his significant experiments in space 
perception, he attempted to reconcile the opposed theories of “nativism” 
versus “empiricism.” William Hamilton (1788-1856) and James Mill 
(1773-1836) were concerned with reformulating more completely and 
rigorously the classical association theory. 

One of the most significant writers in psychology at mid-nineteenth 
century was Alexander Bain (1818-1903), who was Professor of Logic, 
Mental Philosophy, and English Literature in Aberdeen University. His 
two most distinguished works were The Senses and the Intellect (1855) 
and The Emotions and the Will (1859). Bain’s approach was principally 
he utilized, organized, and interpreted findings of the 
German experimentalists in a systematic restatement of associationism. 
Perhaps Bain's most important contribution was his pioneering effort 
to contain the entire range of human experience within a system of 


through physiology; 


psychology. 
Although Wundt's principal work was done somewhat later than Gal- 
ly for his actual contributions but also as 


ton's, he is significant not on 
an example of nineteenth-century neglect of differential psychology. 


Wundt (1832-1920), who established the first laboratory of psychology, 
in 1879, at Leipzig University, employed physiological methods and 
introspection in his and his students research. He held that es qa 
genuinely psychological experiment involved an objectively knowable 
and preferably a measurable stimulus, applied under [specific] conditions, 
resulting in a response objectively known and measured. But there erc 
certain intervening steps which [could be known only] through introspec- 
tion, sometimes supplemented by instrumentation ’ (31, P- 161). Thus, 
Wundt's method emphasized the necessity of knowing and stating con- 

the least added difference of a stimulus that can be noticed 


1 This law states that dif e 
is a constant proportional part of the original stimulus. 
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sciously experienced events as they are related to objective and meas- 
urable stimuli and responses. For Wundt, introspection became the most 
important method of the experimental psychologist. These methods he 
applied to experimental study of vision, hearing, reaction time, psycho- 
physical problems? and to the analysis of word associations. It is interest- 
ing to note that one of Wundt's students from the United States, James 
McKeen Cattell (1860-1944), was impressed by the range of individual 
differences he found in his experiments. Although he was discouraged 
by Wundt from pursuing the subject, he persisted in doing so for many 
years. 

These several examples will suffice to indicate the major interests of 
nineteenth-century psychologists, from which the pioneers in psycho- 
logical testing had to break away. Yet the work of these early psychol- 
ogists did significantly influence the types of testing first used in experi- 
mental work on individual differences. 


Interest in the Mentally Deficient 


In France, during the first half of the nineteenth century, in- 
terest in more accurate differentiation among individuals with regard to 
mental abilities was stimulated by a number of men, of whom two of the 
outstanding will be mentioned: Jean Esquirol (1772-1840) and Édouard 
Seguin (1812-80). They were concerned with mental deficiency and mental 
disease (14, 97). 

Esquirol, in the first place, made explicit the distinction between 
mental deficiency and mental illness. These abnormal conditions were 
at that time generally undifferentiated and confused.) He also distin- 
guished among the several levels of mental deficiency. Esquirol spoke of 
the "weak-minded" and of several grades, or levels, of “idiocy”; the former 
term he applied to what, for many years now, has been called “moronity” 
(and probably also includes borderline cases), while the latter term refers 
to the current terms "imbecility" and “idiocy.” These groups, however, 
were not precisely defined or delineated, although Esquirol did attempt 
ear to distinguish and classify mentally deficient individuals on 
the basis of Physical measurements, especially size and formation of the 

* Psychophysics is the st 


stimulus and the quantit 
Esquirol’s distinction, 


udy of the relation between the physical attributes of the 
ative attributes of sensation. 

widely understood: n ara aaa) the one that has been. current since then, is now 
mental development Sine y, that mental deficiency is a condition of seriously subnormal 
early childhood den to congenital causes or to accidental causes occurring during 
marked by E eue Pr illness (psychosis) is a severe disorder which may be 
sonality disintegration Pairment of mental functions and behavior, and by per- 


SS a 
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skull. It remained for Binet and his collaborator, Simon, to devise the 
first standard scale of intelligence and behavioral criteria that would 
differentiate the three levels of mental deficiency: moron, imbecile, and 
idiot. 

Esquirol did, however, correctly discern the fact that development and 
use of language is one of the most useful and valid psychological criteria 
for differentiating levels of mental deficiency. This observation is of 
historical interest because for many years now the development, use, 
organization, and interpretation of verbal materials have been regarded 
by numerous psychologists as one of the major aspects—in some instances, 
the major aspect—of mental ability. Especially noteworthy among these 
psychologists is the late Lewis M. Terman, about whose work much more 
will be said in subsequent chapters. 

Seguin is noteworthy for his pioneering work and methods in the train- 
ing of mental defectives. He was placed in charge of a school for this pur- 
pose in 1842, after having had his own small school for the training of 
mental defectives for five years. Seguin believed that with appropriate help 
these individuals could improve in behavior, in utilization of their limited 
mental capacity, in their economic adequacy, and in their personalities 
generally. In 1846, his book on the treatment of mental defectives ap- 
peared (37). Like Esquirol, he attempted to find a basis for distinguish- 
ing between idiocy and imbecility, and between these and “backward- 
ness." In 1848, Seguin migrated to the United States where, as in France, 
he stimulated interest in the study and training of mental defectives. 
His methods emphasized the development of greater sensory sensitivity 
and discrimination and of improved motor control and utilization. y 

Both Esquirol and Seguin are of significance to us because of their ef- 
forts to establish psychological criteria upon which to base differentiations 
among levels of mental deficiency; and, as will be seen later, it was this 
problem which provided the strongest original motive regs fenhs test- 
ing movement after 1900. Seguin, furthermore, is np , or his Form 
Board, which carries his name and is part of several performance test 
batteries currently in use. 

Francis Galton’s Contributions 
om the foregoing brief account that until the last 
i enth century there was scant recognition of individ- 
gii hy of study and research by psychologists. 
ual differences as a subject gravium ^ f hologi 

MI doubt, retarded the development of psycho ogical 
This indifference, no » for their measurement. Galton, though 
tests that would be necessary chological work of his predecessors 
interested in and influenced by the psy 


It is clear fr 


quarter of th 
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and contemporaries, was even more strongly influenced by the develop- 
ment of the biological sciences then ascendant among British scientists. 
Consequently, his efforts were devoted largely to investigations of individ- 
ual differences more from biological interests than from psychological. 
In the introduction to his Inquiries into Human Faculty (1883), he 
states (18): 


My general object has been to take note of the varied hereditary faculties 
of different men, and of the great differences in different families and races, 
to learn how far history may have shown the practicability of supplanting 
inefficient human stock by better strains, and to consider whether it might 
not be our duty to do so by such efforts as may be reasonable, thus exerting 
ourselves to further the ends of evolution more rapidly and with less distress 
than if events were left to their own course. 


This quotation is evidence of Galton's sustained interest in developing 
a science of genetics and eugenics. It also indicates a problem with which 
psychologists have since been concerned—the roles of heredity and envi- 
ronment (or, as Galton named them, "nature and nurture") in the de- 
velopment of man's intelligence. For the study of this problem, objective 
psychological tests have been indispensable. 

Prior to the appearance of the volume mentioned above, Galton had 
published the results of his earlier studies in Hereditary Genius (1869), 
and English Men of Science: Their Nature and Nurture (1874). His In- 
quiries into Human Faculty was followed by Natural Inheritance (1889) 
and Noteworthy Families (1906), the last with Schuster. In addition to 
these larger works published during this period of about forty years, Gal- 
ton produced numerous articles on the general subjects indicated by the 
titles of his books. At the same time, his statistical techniques for the 
analysis of data provided the basis for the elaborated, extended, and re- 
fined statistical methods used by such men as Karl Pearson, British bio- 
metrician, and Charles Spearman, British psychologist, who was one of 
the earliest and most noteworthy men to engage in the analysis of human 
abilities (38). 


i 


Galton not only stimulated investigations of individual differences; 
S eond influenced the direction of the experimental efforts to 
tion. He ERES by means of tests of imagery and sensory discrimina- 
M N í be for the measurement of the delicacy of weight 

; nvented what is now known as the Galton whistle 


for measuri IM : 
f Suring sensitivity to high tones. In addition, he suggested devices 
or testing visual an 


d auditory discriminati i i 
ry i è 
cular strength, nations, reaction time, and mus 


Galton a i 
aatis i" apparently, that the simpler and measurable sensory 
uld be significantly correlated with intelligence. That this 
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was his hypothesis is shown by the fact that as subjects for study he 
selected persons of extreme differences in mental ability in order to learn 
whether their differences in sensory discrimination corresponded with the 
known differences in their mental abilities. Although it has long: since 
been learned that sensory and sensory-motor tests have very little value 
for the study of the higher and more complex processes called intelli- 
gence, Galton's work, nevertheless, did strongly affect the course taken 
by test experimenters until about 1900, when the influence of Alfred 


Binet, the French psychologist, was felt. 


Binet's Contributions 


It is impossible in a short space to present a full review and 
evaluation of the character, range, and importance of Alfred Binet’s 
contributions to individual psychology. An attempt will be made, how- 
ever, to indicate his supreme importance in the field of mental measure- 
ments and individual differences. 

Young (48) has quite properly said that "the contribution of Alfred 
Binet stands supreme for its general originality and the fact that he 
hesized the growing movement into his now well-known scale." Binet 
bjected to the types of psychological testing which 
followed Galton's work, on the ground that they were too simple in 
nature and would contribute little to the understanding of differences 
in the complex and higher mental processes; for it is in these higher 
processes that individual differences are most marked, and it is these 
which distinguish individuals most significantly and characteristically in 
daily activity; whereas it is in the simpler sensory and motor processes 
that persons differ least. Binet was quite ready to admit that the simpler 
processes lent th re precise measurement and, therefore, 
yielded more nearly constant results. Yet his QUEUE were strongest in 
individuals rather than in the study of sensations or ideas. Thus he was 
ready to sacrifice the greater quantitative precision of sensory-motor tests 
in order to obtain a more nearly accurate study of the integrated men- 
tality of the individual. He argued that in the measurement of the higher 
functions, the greatest precision, though desirable, was not as essential 
as in measuring the simpler functions, because of the very fact that in- 
dividual differences are more marked in the former. Binet made it clear, 
however, that his proposed scale would not measure in a physical sense, 
: for example, that a line is measured. It would, however, 
in the same way» x hierarchy among diverse intelligences; and for 
Ed "a Kcd this classification is equivalent to a measure" 
A Np c dd his collaborators were interested, consequently, in 


synt 
and his collaborators o 


emselves to mo 
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establishing the extent and nature of variations of the mental processes 
from one individual to another, and in the determination of the inter- 
relations of the various processes within the individual. Binet and Henri 
(a collaborator) proposed, therefore, to study the following functions: 
memory, the nature of mental images, imagination, attention, compre- 
hension, suggestibility, esthetic feeling or appreciation, moral sentiments, 


muscular strength and strength of will, motor skill, and visual 
These are, they believed, "faculties" 


vidual to another and are such tha 
individual gives us a general idea of 
tinguish him from other individuals 
have the beginnings of the tests which 
in the construction of Binet's scales. 


The range and number of publications coming from Binet and his col- 
laborators were remarkable (45). They—especially Binet—interested 
themselves in and investigated an unusual variety of problems relevant to 
individual psychology including such matters as handwriting, head 
measurements, physical growth, physiognomy, and palmistry. Yet his 
abiding interest was in the problems of measuring intelligence and in dif- 
ferentiating between the mental level of one person and another. 


In 1904 a practical situation arose in 


judgment. 
which differ much from one indi- 
t knowledge of their state for an 
this person and permits us to dis- 
within the same milieu. Here we 
a few years later proved so useful 


. Admission was to be determined by a 
psychological examination. Obviously, the first device 


needed was an objective means of selecting those of 


1 subnormal mentality. 
Subjective o 


for not only was there dis- 
“experts,” but serious injustices might result 
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which school children's intellectual abilities might be measured, and with 
which the normal might be distinguished from the subnormal. They de- 
voted their efforts to the evaluation and the quantitative determination of 
"general intelligence," that is, intellectual level, and to comparisons with 
normal children. They recognized that the determination of special apti- 
tudes was a matter for later investigation. In fact, that very problem has 
been studied rather intensively within more recent years.* 

Binet’s first scale (1905), which he himself tested in Paris, was also tried 
out by other psychologists in Europe. As a result of these trials and con- 
sequent suggestions and criticisms, a second and considerably revised scale 
was constructed and appeared in 1908. Again, other psychologists col- 
laborated by using this new scale in their own countries: Decroly and 
Degand in Belgium, Goddard in the United States, Bobertag in Germany, 
and Ferrari in Italy. Binet took account of their findings and criticisms. As 
a result of these and his own investigations, he published another revised 
scale in 1911. This was Binet’s final contribution to the field of mental 
testing, for he died the same year.5 

Binet, the synthesizer and the originator, provided the original major 
impetus to the study of individual differences by means of standardized 
tests. Since 1911, revisions and adaptations of his scale have been made in 
a number of countries. Most later developments have been expansions, 
modifications, and improved standardizations of the 1911 scale. Under- 
standably, the principal interest for some years following Binet was in 
the identification and classification of mentally defective individuals. 


Developments in the United States 


Early Experiments. One of the most important of the early 
American psychologists in the study of individual differences was James 
McKeen Cattell (1860-1944), a man much younger than Galton, but still 
his contemporary. “It was Cattell,” says Professor R. L. Thorndike, "[who] 
... was perhaps the first rebel from within the ranks of psychologists 

. to set his face against the narrowness of the Wundtian School 
where . . . individual diversities were hidden in averages, or even dis- 
carded as erroneous. . . . Cattell was bold enough to declare, in reference 
to reaction times, that . . . "The individual difference is a matter of 
special interest.’ Wundt opposed any study of individual differences in 


themselves" (48, p. 32)- à uy E 
The term “mental tests" was first employed by Cattell in a publication 
*Binet and Simon excluded from consideration those persons who had suffered 


i ization; i dements. 
mental disorganization; that is, the A 
5 Binet's sels are examined in some detail in Chapter 8. 
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of 1890 in which he described tests then being used in his laboratory in 
the University of Pennsylvania (11). Cattell's tests were of memory, im- 
agery, keenness of eyesight and of hearing, afterimages, color vision, color 
preferences, perception of pitch and of weight, perception of time inter- 
vals, sensitivity to pain, rate of perception and of movement, accuracy of 
hand movement, and reaction time. 

The last of these was the most important of his early contributions to 
differential psychology; for much of the subsequent interest in reaction- 
time experiments is attributable to Cattell's work (20). One of the most 
direct methods with which certain of the simpler mental processes, 
such as discrimination and choice, can be studied is the precise measure- 
ment of the time an individual requires to respond to a given stimulus 
or to perform a specified act, usually a very simple one. Although many 
experiments on reaction time followed those of Cattell, and although 
these have added considerable information about speed of response to 
some types of stimuli, they have not made significant contributions to 
our understanding of higher, complex mental processes; for reaction time, 
it has been found, has little or no value in estimating intellectual abilities. 

The time factor in itself is a relatively minor aspect of most mental 
tests, except in those devised specifically to measure speed of performance, 
usually in a restricted type of activity for a specified purpose. A good ex- 
ample is the rate at which one can discern likenesses and differences be- 
tween two sets of digits or letters of the alphabet—a form of clerical test. 
However, Cattell justified his tests of sensory discrimination, motor ac- 
tivity, and simple reactions on the ground that his purpose at that time 
was principally anthropometric; therefore, measurement of the senses 
Properly belonged within the scope of his research (12). He and his 
collaborators realized that the more complex mental processes should be 
measured; but they were also aware of the fact that much research and 
analysis had yet to be done before adequate mental tests could be devised 
for the measurement of these processes. 

Other investigators in this country and abroad were experimenting 
With psychological tests, following very much the same paths at those of 
Galton and Cattell. Jastrow tried out tests of touch and cutaneous sensi- 
TD Ed vision, memory, and reaction time (25). Gilbert used 
rapidity of ta con weight, and lung capacity; also tests of sensation, 
these fe dira Sordon time, memory, and suggestibility. Against 

SUR MN Pow ES ratings of their pupils’ mental abilities (22). 
tific methods and ae R individual differences by objective scien- 
early as 1895 when TR a comprehensive research was emphasized as 

merican Psychological Association appointed a 


committee of which Catt w t 
i 
ttell was a member “. . , to consider the feasibili y 
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of cooperation among the various psychological laboratories in the collec- 
tion of mental and physical statistics" (12). Also, in 1896 the American 
Association for the Advancement of Science appointed a committee 
“|. to organize an ethnographic survey of the white races in the United 
States" (12). Cattell, who was also a member of this committee, stressed 
the importance of including psychological tests in the survey and of co- 
operating with the committee of the Psychological Association. 

The development of testing had assumed importance to educators at an 
early date. In 1899, President Harper of the University of Chicago 
*.. . recommended that a special study be made of the college student's 
character, intellectual capacity, and tastes by the questionnaire method" 
(10). Further, in 1909, a committee of the National Education Association 
presented a report regarding psychological tests for mentally deficient 
children (8). From the report, it appears that the tests were looked upon 
as applicable chiefly to the subnormal and to other exceptional children. 

Revisions of the Binet. The great development in testing and 
studying individual differences in the United States occurred after Binet's 
work was made known. Goddard was the first to revise the Binet scales 
for use in this country. In 1911, he published his standardization of 
Binet's 1908 revision, with which he had been acquainted since 1909 (23). 
At that time, he was director of the laboratory of psychology at the Vine- 
land (New Jersey) Training School for Feeble-minded Children. Thus, as 
in France under the guidance of Binet, the scale in this country was first 
used almost entirely for the study and selection of mentally deficient in- 
dividuals. The Binet was made a part of the routine procedure at Vine- 
land, and it was rapidly adopted for use by psychologists in other in- 


stitutions. à } 
who in 1911 published a revision of Binet's 


Goddard and Kuhlmann, 

1908 scale, made the test known and were largely responsible for its early 
spread among clinical psychologists (80)- Lewis M. Terman, who had al- 
ready interested himself in psychological differences among individuals, 
brought the scale before the schools of the country. In 1912, he published 
a tentative revision of the Binet; in 1915; he completed this revision with 
collaborators. In 1916, he published The Measurement of Intelligence, 

m, its standardization and direc- 


which presented the scale in its revised for r ! 
tions for administering and scoring, as well as brief explanations of the 


psychological justification for each part (41): prs 
In 1937, 4 revised and much improved edition in two forms was pub- 


lished in collaboration with Maude E. Merrill (42)- Inevitably, of course, 
another revision of the scale had to be prepared. mhis last editioni ap- 
peared in 1960, again under the coauthorship of Terman ps m 
although Terman had died in December, 1956 (48). The 1910 anc 1937 
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editions have been widely used in clinics, schools, and other agencies; and 
the 1960 edition, it is reasonable to assume, will also enjoy widespread 
currency. : 
Group Tests. Shortly after 1916, the most significant. occur- 
rence in psychological testing was the development of group tests. The 
Binet and its several revisions are administered to 
length of time required varying with the age, brightness, and responsive- 
ness of the individual being tested. As a result, it is costly in time and 
money to test large numbers of persons one by one, and in some instances 
it is impossible to do so. "Therefore, if many people are to be tested at 
once, as is the case in the schools and the armed forces, a group test will 


have distinct advantages if it yields sufficiently accurate and dependable 
results. 


each person singly, the 


Psychologists had already begun to study, by 
mental processes required in school work. S 
entirely new Step to try devising a single sc 
testing several mental processes, would bi 
&roup testing received its greatest impetu 
the United States into World War I. At t 
With the views of a group of psychologis 
examine the newly drafted men to determine their general mental capac- 
ity and vocational fitness by means of the best available psychological 
methods, The need was a pressing one, and a group-testing method was 
Imperative, This army problem enlisted the interests and cooperation of 
ead Pre aelagiiu, some of whom had already made contributions to the 

eld o 


: measurements, and some of whom were already experimenting 
with group methods. Pooli 


group tests, some of the 
o it was not a very long or 
ale in which a variety of items, 
€ combined. This tendency in 
s in 1917 with the entrance of 
hat time the government agreed 
ts that it would be desirable to 


t 
in which g 


other areas, esp 
"There are to 
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reliable (see Chapter 4) and have reasonably good validity (see Chapter 5), 
whereas others do not withstand scrutiny and evaluation. 

Performance Tests. Not long after the introduction of the 
Stanford-Binet scale, its emphasis upon language was criticized by some 
psychologists and educators. It was maintained that this scale, valuable 
though it is, needed to be supplemented by tests which do not require 
ability to deal with words, numbers, and abstract concepts. Accordingly, 
“performance tests" were developed to meet this criticism and to provide 
means of testing individuals with language handicaps, as well as the deaf, 
the blind, and others for whom an adequate rating could not be obtained 
with tests that depended largely on language, numbers, and abstractions. 

A performance test provides a perceptual situation in which the subject 
manipulates items such as form boards, blocks, pictures, and disassembled 
objects instead of reasoning with symbols. Some psychologists apply the 
term also to “pencil-and-paper”’ tests that utilize nonverbal materials such 
as printed geometric forms, pictorial representations, printed cubes, sub- 
stituting digits for symbols, and the like. It seems preferable, however, to 
designate these simply as “nonverbal” tests because they do not involve 
actual manipulation of objects as do performance tests. Both types of test 
materials, performance and nonverbal pencil-and-paper, are now used ex- 
tensively. Some scales, such as the Arthur and the Pintner-Paterson, are 
built entirely of performance materials; other scales combine one or both 
with verbal materials. 

Aptitude Tests. Another type of instrument, the development 
of which received impetus in World War I, is the aptitude test. Each of 
these, unlike tests of general ability, is intended to measure an indi- 
vidual's ability to perform a task of a limited or specific kind, for ex- 
clerical, mechanical, or musical aptitudes. Interest in and develop- 
esting may be ascribed to several causes: the army's 
need, during World War I, to select men for tasks requiring specific 
skills; the desire, in vocational guidance and personnel assignment, to find 
the right person for a specific job; the Sp position of some educators and 
psychologists to what they called the “super-faculty” of general intel- 


ligence; and the belief of some of them that only specific aptitudes, such 
as mechanical and clerical, could be satisfactorily measured. As a matter 
* Some psychologists prefer to avoid the use of the term “intelligence,” and to speak, 
instead, of “general aptitude,” “general ability, scholastic aptitude, and the like. 
We shall continue to use the term "intelligence" because: (1) it has a long and 
respectable history in psychology; (2 many of the most important tests with which 
we shall be concerned are called tests of intelligence; (3) we shall have to deal with 
what psychologists have long called theories and definitions of intelligence; and (4) 
because there seems to be no merit in substituting the term "general ability" or 
l intelligence." Furthermore, even those who would 


"general aptitude" for "genera. M h y 
reject the term “intelligence” must and do use the concept of “intelligence quotient.” 


ample, 
ment of aptitude t 
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of fact, tests of general intelligence and those of specific aptitudes do not 
and need not stand opposed; they are supplemental. 
Aptitude tests have been developed to predict educability and per- 
formance in music and drawing, in mechanical and clerical occupations, 
in engineering, in medicine and law, and in other areas as well. Others in 
this category are intended to evaluate aptitudes for the study of specific 
types of subject matter, such as science, foreign languages, and mathe- 
matics. 
Occupational Interest Inventories. To supplement tests of ap- 
titude and those of intelligence, several self-answering occupational pref- 
erence questionnaires, or inventories, have been devised to provide in- 
formation regarding an individual's interests in a variety of activities; for 
these, it has been found to have some relevance to and predictive value 
for certain broad vocational areas or for certain specific occupations. 
Tests of Educational Achievement. Closely associated with the 
testing of aptitudes is the measurement of educational achievement and 
the construction of objective measures for that purpose. These are not 
designed primarily for prediction; instead, they are intended to measure 
the individual's actual learning in educational subject matter after.a pe- 
riod of instruction. They have proved to be highly valuable in the de- 
termination of individual difficulties in learning, in the discovery of 
strong scholastic interests, in the discovery of special abilities or dis- 
abilities, and, in combination with other factors, in plotting the educa- 
tional career of the individual child. 
Educational achievement tests have other values as well: they provide 
objective measures of progress, as opposed to teachers’ ratings that may 
be too subjective; they permit intergroup comparisons based on a rea- 
sonably objective determination; and they facilitate experimental evalua- 
uon of varied teaching methods. 
Test Batteries. During World War II many test "batteries" * 
Were constructed. Those that made use of specific aptitudes and subject- 
matter knowledge—especially the former—were most important. Batteries 
ve devised for the selection and training of personnel in a great variety 
M Ee segeral branches of the armed forces: radio and 
24 M à) ida M ME oia gunners, flight engineers, and other 
NUM "esa ur Hes ol these batteries in the armed forces 
Sir MNA vA use of Similar tests for the selection and train- 
: n occupations. 

ms c aor Tests. These, also called “differential 
iv 

and evaluation. I 


aptitude 
ely recent developments in psychological measurement 
nterest in them has increased markedly since about 1945, 


7A “battery” i 
of te a gri " TES TEN. : 
y tests is a group of tests used in combination for a specified purposc. 
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although research on the subject began as early as the 1920's, when T. L. 
Kelley (28) and, later, L. L. Thurstone (44), published their work on 
factorial analysis of human abilities5 Factorial analysis provided the 
statistical tools for the development of multifactor tests, which isolate and 
measure relatively "pure" mental operations (factors) or “constellations” 
of closely related factors, rather than general intelligence or general 
ability. In other words multifactor tests isolate the elements that consti- 
tute mental operations. The psychological principle upon which these 
instruments are based is the theory that the factors, or elements, are 
relatively independent of one another; hence, it was concluded that they 
should be measured independently. 

Multifactor scales were expected to be especially valuable in educa- 
tional and vocational counseling because they consist of separate tests of 
numerical operations, space relations, form perception, name perception, 
verbal reasoning, rote memory associations, and others restricted in com- 
plexity and range of mental operations. Each factor, or test, is thought 
to have special educational and vocational relevance and predictive value 
in itself; and a combination of factors is thought to have predictive value 
for specific areas of learning or occupations. The use of multifactor scales, 
therefore, would yield a "profile" ° of scores for each of the several factors 
or "constellation" of factors, rather than a general, over-all rating for the 
entire scale, such as those derived from the Stanford-Binet, the Wechsler, 
and numerous group tests. All of these will be described and evaluated in 


subsequent chapters. 


Personality Tests 


Efforts to evaluate and test nonintellectual traits of personality 
were apparent in the nineteenth century beginning with Galton in 1879 
(17) and followed by Pearson (35), who devised questionnaires and rating 
scales. During the last decade of that century and the first of the twentieth, 
word-association tests were tried out by Jung of Switzerland (26, 27) and 
Kent and Rosanoff in the United States (29) in an effort to expose some 
of the "deeper" personality traits and, if possible, to assist in differentiat- 
ing among the various mental disorders. Although word-association tests 
are still used today in psychological clinics and elsewhere in diagnosing 
personality waits, they are much less frequently employed than inven- 
tories and projective techniques. 
Kelley and Thurstone in making statistical analyses of 
human abilities; but he is not associated with the multifactor test movement. 
9A psychological "profile" is a chart representing an individual's score or relative 
, position in cach of several types of performance, with separate scores made comparable 


by statistical treatment. 


# Spearman preceded both 
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With widespread use of individual tests of intelligence in schools, 
clinics, and hospitals, it became increasingly clear that in some cases an 
individual's performance on a test, his successes and failures, and the 
content and quality of his responses, were not only evidence of intellectual 
functioning, but were also affected, in greater or lesser degree, by non- 
intellectual traits of personality. The recognition of this fact, in addition 
to the growing interest in the scientific and clinical study of personality 
per se, provided the stimulus for the development of the several varieties 
of personality tests. Personnel problems during World War I provided 
impetus for their growth as well. 

"Today the tests are used extensively for the analysis of desirable and 
undesirable traits in a wide range of civilian and military occupations. In 
addition, psychologists employ personality tests in studies of differences 
between subgroups within the same general society and of differences be- 
tween various cultural, national, and racial groups. 

The large current crop of personality tests now available varies in 
quality from those that are poorly conceived, inadequately validated, and 
therefore useless, to those having considerable value in the hands of com- 
petent psychologists.1° 
i Rating Scales. The earliest device employed, the rating scale, 
1s a means of obtaining the judgments of a number of respondents with 
reference to a limited number of traits of a given individual. They were 
tried out and used during World War I, well before they were formalized 
and scaled both by statistical methods and by psychological analysis of 
personality and behavior traits relevant to specified situations. 

Self-Rating Inventories. The first self-report, questionnaire 

type of personality inventory is the Personal Data Sheet, devised by R. S. 
Woodworth for use in World War I and published in 1919. Employed 
en, vs iu was to a men who would prove 
RA Epen M ns Psi d ecause of undesirable person- 
items in the form tA M ern Ps t hin Ae pedi. Mur, Ae T y 
vidual. The aim of p iGonnulie ots deu OR ni serie 
2 questionnaire is to detect personality and be- 
havioral Symptoms that are regarded as indicative of maladjustment. The 
eas the Data Sheet took the place of an individual interview. 
ethene. i arab indicated a sufficient number of undesirable symp- 
and the aspects pare individually. The types of questions asked 
personality sampled were forerunners of many of 


those incl T ; 3 " 
uded, with very little modification, in subsequent inventories. 
"Cf. O.K. B , 
iter: "TOS (9). In this volume, 145 personality tests of all types are critically 
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Since the appearance of the Woodworth Personal Data Sheet, dozens 
of personality inventories, representing several different types, have been 
published. In general, the emphasis of the items—questions or state- 
ments—in these instruments is on what the individual respondent ac- 
tually does in various kinds of situations and on how he feels about what 
he does in these situations. Relatively few of these inventories, however, 
have survived scientific analysis and practical use. Until the early 1930's 
these were, however, the principal instruments used to evaluate person- 
ality traits in a systematic and scientific, or quasi-scientific, manner (40). 

Projective Tests. In the early 1930's a newer type of instrument 
became prominent in American psychology: the projective test of per- 
sonality. This instrument is much more subtle than the self-rating in- 
ventory; it presents more or less equivocal, undefined ("unstructured") 
stimulus situations, usually in the form of pictures, inkblots, or incom- 
plete sentences. Thus, the person being tested has a greater opportunity 
to impose upon the test his own private and particular personality traits 
than would be exposed by means of the questionnaire type of inventory. 

The best known of the projectives is the Rorschach Inkblot Test, first 
published in Switzerland in 1921, although not introduced into the 
United States until the early 1930's. Rorschach, a Swiss psychiatrist, began 
his experimentation with inkblots as a means of stimulating and testing 
imagination. In the course of his work (1911-21), he perceived the pos- 
sibilities inherent in the inkblot test as a device for differentiating among 
various kinds and traits of personalities. Although Rorschach's work on 
inkblots was the most extensive of any up to that time, he was not the 
gator to discern the possibilities of inkblots in psychological 
ition. As a matter of fact, these had been used for some years 
ories to study fertility of imagination and of in- 
vention !! (46, part II). Since the introduction of what has come to be 
known as "the Rorschach," it has been extensively used in private psy- 
chological practice, in clinics, and in hospitals for diagnostic purposes; 
in business and industry for some types of personnel selection; in re- 
searches in cultural anthropology; and in researches on personality theory. 
Interest in and use of the Rorschach can be inferred from the huge num- 
ber of professional publications on the subject, DO did not begin to 
appear in appreciable numbers until about 1935. . 

Another projective instrument of major importance is the Thematic 
Apperception "Test, introduced by H. A. Murray and C. D. Morgan in 


first investi 
experimenta 
in psychological laborat 


u Among those who early suggested ue use of inkblots were Binet and Henri, in 


oA O. K. Buros (9), 2297 publications are listed. 
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1935. This test consists of thirty rather ambiguous pictures, each on a 
separate card, and one blank card. The person being examined is asked 
to make up a story of his own for each picture. The psychological prin- 
ciple involved is that in his stories the examinee will, probably un- 
wittingly, give expression to his needs, values, attitudes, and feelings about 
persons, situations, and the world around him, as well as to the pressures 
he is experiencing from sources outside of himself. This instrument, too, 
has been and is being widely used in a vai iety of psychological settings. 
While the number of publications on the TAT, as it is professionally 
known, is not so great as that on the Rorschach, it has, nevertheless, been 
the subject of many studies and researches." 

Since the appearance of the Rorschach and the TAT, a variety of 
other projective devices and techniques have been made available. Some 
of these are special adaptations of the two foregoing tests; others offer 
rather different approaches for the same general purpose, that is, to elicit 
responses which will reveal aspects and traits of personality that inven- 
tories and rating scales are incapable of eliciting. Since 1945 and to the 
present time, projective tests have occupied a position of primary im- 
portance in practical applications and in research. 

The types of techniques for obtaining evaluations of aspects of per- 
sonality thus far mentioned do not exhaust the list. Among other and 
more tenuous kinds of procedures used are storytelling and story com- 
pletion, drawing and painting, and "situ 
vidual's behavior is observed and rated in a setting that simulates reality 
(84). Contrived play activities, usually of one child who is being ob- 
served, are used for two purposes: to permit the child to project some of 
his inner traits and to serve as a form of psychotherapy. Sociometric 
methods, whereby an individual's social currency or acceptability is ob- 


tained from ratings made by his peers, is an adaptation and extension of 
the older rating scale (83). 


ational tests," in which an indi- 


Although all of these procedures are used in their appropriate settings, 
they are much less commonly employed in personality evaluations than 
are self-rating inventories, the Rorschach, and the TAT, because, being 
tenuous, they are not susceptible to 
To be sure, personality inventories and the more widely used projective 
ae their own problems in standardization. However, progress 

nd continues to be made with these and their development has 


grins far enough to provide sufficient common ground and research 
ation, so that in the ha i : i 
e hands of qualified psychologists they are of 


13 I . 
n O. K. Buros, op. cit., 610 publications are reported. 


standardization and objectification." 
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The Present Situation 


Psychological tests of intelligence, whether based upon the 
theory of "general ability" or upon one of relatively independent factors 
(or aptitudes), and tests of specific aptitudes and skills are now at a rea- 
sonably advanced stage of development. This is so because they have been 
in the process of evolution and improvement for many years, a tre- 
mendous amount of rescarch has been devoted to them by numerous 
psychologists, and they have been used in a variety of practical situations 
where their validity could be evaluated. Another reason is the fact that 
determination of the mental functions, or operations to be tested, though 
not simple, has not been as difficult as the determination by testing of 
nonintellective traits of personality. 

Because "personality" is so all-inclusive a concept, and because its 
manifestations are often complex and covert, development and use of 
self-rating inventories and projective tests are as yet not on so secure a 
foundation as tests of mental abilities, of specific aptitudes and skills, and 
of educational achievement. In subsequent chapters, we shall discuss the 
principles upon which all of these types of tests are based, as well as 
their values and their limitations. 

The great variety of psychological tests in existence has already been 
mentioned. 'The numerous uses to which they are put and the important 
in the determination of an individual's educational, 
vocational, or general welfare have been indicated. It is essential, there- 
fore, that anyone who employs these tests in a professional capacity should 
understand the basic psychological and statistical principles upon which 
they rest. It is necessary that everyone—teachers, psychiatrists, guidance 
counselors, personnel administrators—who interprets the results of test 
findings should be familiar with their essential theory as well as with the 
meanings of the technical terms. 

Since the end of World War I, the use of psychological tests has con- 
tinuously increased, because they are needed and because they have im- 
proved steadily. Education in the United States has become more nearly 
universal; individuals of inferior and those of seriously deficient mental 
abilities are being retained in public schools much longer than was the 
case in earlier years. Thus, the range of intelligence found in schools ex- 
tends from the very low to the highest levels, making it essential that 
each individual's educational potential and promise be known as ac- 
curately as available psychological means rtis The general increase in 
years of schooling, not to speak of the tremendous growth in numbers of 


students, has extended to college and university, so that the importance 
: 


part they may have 
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of knowledge of individual variations in mental ability at higher educa- 
tional levels has also grown. 

Educational and vocational guidance, at all levels, have consequently 
assumed increasing significance. With the availability of standardized 
tests, even with their defects, guidance has been placed upon a more 
objective basis, instead of remaining a matter of subjective, perhaps even 
casual, advice. i 

For many years schools and, now more recently, colleges and univer- 


sities have been concerned with the learning difficulties of. individuals. 


Are these difficulties due to inferior general intelligence? Or are they due 


to specific disabilities, as in reading or spelling? Or to defective percep- 
tion of spatial relations? Or perhaps to defects of the visual-motor func- 
tion? Is an individual's lack of aptitude in shopwork attributable to in- 
ferior manual dexterity? Is the individual's learning impaired or retarded 


by poor ability for recall of rote or mez 


aningful materials although his 
level of general ability might otherwise be adequate for learning? Answers 


to these and other important educational problems have been provided 
or at least facilitated by the use of psychological tests.!4 

: The types and numbers of occupations have multiplied, and specializa- 
tions within the types themselves have increased. It is unnecessary to de- 


tail the vocational changes and developments that have taken place with 


technological and scientific developments, but it does seem necessary to 


gical testing and vocational 
ame name are not necessarily 
specialized functions, and interests 
15 factors that combine in different 
ptitude called “mechanical”; but, 


knowledge, 
there are variou 
Ways to create not a single, unitary a 
rather, there are several different as 
though all have something in common, “Engineering aptitude” is not a 
single, i ; differences in requirements for 
ency in civil, mechanical, electrical, and 


mutually exclusive, Nor is“ 
The fact that each 


Mo a at tests are administered and interpreted 
i Will be said on this matter in subsequent 
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practice in them. These professions include medicine, in which there have 
been researches on desirable personality traits of medical students. Some 
engineering schools would like to identify those nonintellectual traits 
that distinguish the successful from the unsuccessful students of the pro- 
fession. Psychologists are desirous of determining personality character- 
istics of the more promising students of clinical psychology. Some re- 
ligious denominations require that candidates for admission to their 
theological schools take tests of nonintellectual personality traits as well as 
of mental ability. 

Finally, there is the whole area of “mental health," to which so much 
attention has been given since the termination of World War II. Schools 
and colleges are concerned over individuals who present more than or- 
dinary degrees of personality difficulties or of problems of behavior. 
Numerous bureaus of child guidance have been established within school 
systems; there are mental health clinics in many sections of the country; 
federal hospitals (for example, of the Veterans' Administration) have ed 
chological divisions, as do many state and some private hospitals. In all 
of these settings, psychological testing of all types, especially involving 
nonintellectual personality traits, is one of the established practices. And 
it is not uncommon for private welfare agencies to have on their staffs 
psychologists whose work consists of psychological diagnosis by means of 
tests, or of the practice of psychotherapy, which is often based upon or 
facilitated by diagnostic testing, or of both. Also, many psychologists in 


private practice make diagnostic testing a significant or a major part of 


their work. 
This brief account of the current role and extent of psychological 


uld be sufficient to emphasize the development of this branch 
of psychology since its relatively modest beginnings, shortly after the turn 
of the twentieth century, when the principal purpose of testing was the 
identification and special schooling of mentally deficient children. 


testing sho 
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ELEMENTARY STATISTICAL 
CONCEPTS 


Construction of psychological and educational tests requires the 
use of statistical methods, as does a sound evaluation of the results ob- 
tained with these devices. Those who administer tests and those who 
evaluate the findings must, therefore, understand the meanings of the 
basic statistical concepts. It is the purpose of this chapter to present these 


concepts and their significance in the statistical treatment and interpreta- 
tion of test results. 


Two Kinds of Statistics 


Statistical methods and indexes are of two kinds: the descriptive 
and the sampling. The first are measures, or indexes, that show the prin- 
cipal characteristics of a set of data. When it is found, for example, that 
65 percent of an entering college freshman class com 
study, this is a simple descriptive statistic for the group. When it is shown 


that of those individuals in the lowest decile group (lowest 10 percent) 
on a scholastic a 


ptitude test only 40 percent complete the course of study, 
we have a somewhat more analytical, descriptive statistic. But it is de- 
sirable, in fact essential, to know the relationship between ranks on the 


plete their course of 


* Details of computational methods will not be presented, since the development of 
computational skills is not our purpose. For methods of computation, the student 
should consult any of the standard textbooks on statistics. Nor shall we be concerned 
with derivations of formulas. 
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test and achievement in college courses for the entire group. For this 
purpose, several different statistical methods are used, notably correla- 
tions. The coefficient of correlation, explained later in this chapter, is a 
much more complex statistic than is a percent; but it is still an iudex de- 
scriptive of the observed data for this one group. 

These indexes do not necessarily hold for groups or individuals other 
than the ones actually studied. General predictions cannot be made from 
these data and their indexes; nor can they be applied to results that 
might be obtained from other groups, unless the statistical techniques 
of predicting from a sample are used. In that case, the ardent d of 
the groups from which, and for which, the predictions are being made 
must be considered. That is to say, not only are sampling errors calcu 
lated, but comparability of groups must be established. Á 

Analysis of the sampling error of obtained data is the second type of 
statistical method. The purpose of determining sampling errors n in- 
dicate the limits within which predictions can be made from one par- 
ticular group to another and comparable group, or to predict from ET 
to future performance of the one particular group. In other words a an- 
swer is sought to this question: “How dependable are these obtinet 
data for predictive purposes?” Analysis of sampling errors provides a 
statistical method of finding an answer. 


Descriptive Statistics 


The Normal Distribution Curve. This curve is of primary im- 
portance in the science of psychological measurement because it has been 
approximated so frequently by the scores of large unbiased samples of 
individuals on tests of intelligence, and on other psychological and educa- 
tional tests as well. A generalized form of the normal curve is shown in 
The obvious properties of this curve (often called “bell- 


Figure 2.1. i c 
s a single mode (high point), extends symmetrically 


shaped”) are that it ha 


p between sigma distances and 


d between ordinates in normal curve 


Fic. 2.1. Relationshi 
area include 
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in both directions (theoretically, without limit), and gradually approaches 
the base line as an asymptote. Interpreted in terms of frequency of cases, 
or of scores, this simply means that the scores are concentrated around the 
mid-point and gradually decrease in frequency as the distance from the 
center increases. 

As will be seen in later chapters, normality of distribution is used as 
one criterion of the validity (see Chapter 5) of many tests of intelligence, 
a practice that emerged from early findings that the scores of such. tests 
approximate the normal curve. What this curve signifies psychologically 
in respect to intelligence, for example, is that a large percentage of indi- 
viduals are located at, or close to, the center of the distribution, and that 
farther up or down in the distribution, there are, respectively, fewer and 
fewer superior or inferior persons. 

Skewed Distribution. If a set of scores or measures is distinctly 
asymmetrical—that is, when scores are relatively concentrated at one side 
or the other—there is a skewed distribution (Fig. 2.2). If such a curve 
were found for the scores of a representative group in the process of 
standardizing an intelligence test, it would indicate, depending upon 
the direction of skewness, that the test was too easy or too difficult for 
that group. In another instance, if a well-standardized test had been used, 
the skewed distribution might indicate that a selected group had been 
tested. The interpretation of a skewed distribution depends, therefore, 
(1) on the measure used and (2) on the group with whom it is used. 


FREQUENCY 
FREQUENCY 


SCORE ON TEST 


Fic. 2.2. Negativel 
of test scores 


SCORE ON TEST 


y skewed (left) and positively skewed distributions 


The Frequency Distribution. 


of data i i istributi 
* à is to arrange them in a frequency distribution, that is, to put the 
Ores Into a systematic, condensed for 


- 3 z m that gives an over-all, compre- 
e Min How set (Table 2.1). A frequency distribution is a 
such a size as will ae fhe a of values is divided into class intervals of 
SH € the table understandable, without distorting the 

acter of the data? After the distribution table has been 


The first step in ordering a set 
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made, additional descriptive indexes, which will be discussed in this 
chapter, are calculated and a graph is plotted. 

Tables 2.1 and 2.2 are illustrations, respectively, of an approximately 
symmetrical and an asymmetrical distribution. These tables are repre- 
sented in Figures 2.3 and 2.4 (called frequency polygons). 


TABLE 2.1 


FREQUENCY DISTRIBUTION OF TEST SCORES 


Scores Frequencies 
143-147 3 
138-142 5 
133-137 21 
128-132 35 
123-127 52 
118-122 37 
113-117 22 
108-112 18 
103-107 6 
98-102 MU 
N — 200 

ae RE nuo —À 

TABLE 2.2 


A SKEWED FREQUENCY DISTRIBUTION 
Scores Frequencies 


100-104 
95-99 
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95 100 105 110 1I5 120 125 130 135 140 145 150 
SCORES 


Fic. 2.3. Frequency distribution plotted from 
Table 2.1 
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Fic. 2.4. Frequency distribution plotted from Table 2.2 
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res of variability, or dispersion, of the 


tof tendency is a single value intended to 
Principal characteristics of the group studied. But 


E ties often referred to as an "average") has 
Itself, because it is derived from individual scores 


i HUE : ay, for example, that the mean IQ (in- 
children is 115, tells us little about the 
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characteristics of that group. It is essential, therefore, to accompany a 
mean or median by one or more measures of dispersion, discussed later in 
this chapter. 

The mean is simply the total of all of the values in the distribution, 
divided by the frequencies (or number of cases).? The principal property 
of the mean is that every score, whether large or small, contributes its 
proportionate share to the result. At times this is a defect of the index, 
because a relatively small number of extreme scores will have an undue 
effect upon the mean, and thereby give a misleading impression of the 
group as a whole. The mean is particularly valuable, however, in deriving 
additional indexes, since it is rigorously defined mathematically and 
usually has the smallest sampling error. 

Tur Mepian. This is the middle value of the distribution, on each 
side of which fifty percent of the total number of values are located. 
Medians are of two kinds: counted and calculated. If the number of 
scores is not too large and if they have been arranged in order of in- 
creasing or decreasing size, finding the median is a simple matter of count- 
ing to find the central value.* If the number of cases is large, the median 
can be calculated from the frequency distribution. Counting from the 
top or bottom, we find the class interval in which the median case (N/2) 
falls. The exact value of this case is estimated by assuming that the scores 
within that class interval are evenly distributed. 'This method of estimat- 
ing the median results in a relatively small error. 

It is apparent that in deriving the median, every value, regardless of 
its size, is as important as any other value. It is, therefore, not affected by 
relative size or by extreme scores. This is its principal advantage, aside 
from the ease and simplicity of finding the counted median, and the 
relative ease of finding the calculated median. A secondary advantage is 
the fact that the meaning of the median is readily apparent. : 

Tur Mope. This measure of central tendency simply indicates the 
most common value in a given distubution: In a frequency distribution, 
the mode is represented by the mid-point of the MS interv “tee the 
greatest frequency. While a representative samp'e © apop atona ex. 

set of measurements, it sometimes 


i le in a 
ected to yield only one moc t 
ha opens bot a distribution of scores shows two modes (Fig. 2.5). Usually 
Ainda curve is attributable to special characteristics of the group 
a 


measured. Note that in Figure 2-5, the mean score is in E pia to, "CBE 
d a nan the modes. In a normal, unimodal distribution, on the 
ce V a f r i 
5 cj nd the mean coincides with the mode. In a curve that approxi- 
o x E A Ke the mean is close to the mode, as in Figure 2.3. The 
mates the a5 


e taught generally in the fifth grade of elementary school. 


of eG the two middle values are averaged. 
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ne: ber of cases 1S even, 
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FREQUENCY 
mo 
[2] 
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O 5 10 1 20 2 30 35 40 45 50 
TEST SCORES 


Fic. 2.5. A bimodal distribution 


mode is of little value in psychological testing because it is not used in 
conjunction with other descriptive statistics. 


COMPARISON OF MEAN, MEDIAN, AND MODE. 
data and the purpose of the statistical analysi 
these indexes will be used. Often, however, 
measures of central tendenc 
the distribution’s skewness 


The characteristics of the 
s determine which one of 
it is desirable to find two 
y as an indication of the effect produced by 
or other irregularity.5 

The following two sets of scores illustr. 
mean and median are ne 
ferent. Consider the follo 


ate conditions under which (1) 
arly identical and (2) they are markedly dif- 


wing nine IQs: 87, 88, 89, 9o, 97, 98, 100, 110, 
123. The mean is 98; the counted median is 97. Thus, it makes little dif- 


ference whether one or the other is used to represent the group. By con- 
trast, now consider the following values: 72; 73» 74 75, 76, 77, 79, 81, 140. 
Here the mean is 83, median is 76. The mean in this instance 
is raised appreciably above the general run of scores. The median is, there- 
fore, preferable as an index of central tendency. The mode, obviously, has 


no meaning in either instance. This illustration, of course, has been con- 
trived; but it illustrates what can, 


whereas the 


sion. As already stated 
ndency is insufficient as a 


benhi of a “25-year class” of a large 

of men who hold ositi BPD. Of that class includes a relatively small number 

to a level that is uud extraordinarily high salaries. These raise the mean 
Vi : 

een a preferable index ive of the class as a whole. The median would have 
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parable. To assume, for example, that two groups of pupils are equiv- 
alent simply because both have mean IQs of 100 would be to overlook 
the possibility that one might have a range of go-110, whereas the other's 
range might be 70-130. The educational and psychological characteristics 
of each group, as a whole, will differ significantly from those of the other. 
Figure 2.6 represents this situation. 


Fic. 2.6. Normal curves having identical means but 


different variabilities 


asure is the most obvious and readily observed index 
erence between the lowest and the highest 
values in the set of scores. Although it is a useful index at times, its 
serious disadvantage is that it is determined by only the two extreme 
scores, even though most of the other scores in the distribution might 


be far removed from the extremes, for example, 1Qs of 50, 90, 95, 100, 
140. If, on the other hand, the values varied 


the range would be a more appropriate 


Rance. This me 
of variability. It is the diff 


103, 105, 106, 107, 110, 112; 
continuously from 50-140, 
measure. 

Mean DEVIATION. 
of each score from the mean, 
then finding the arithmetic mean 


This index is derived by calculating the deviation 
without regard to plus or minus signs, and 
of these deviations. All cases are thus 
taken into account. A small proportion of extreme scores will not appre- 
ciably distort the mean deviation; but it is subject to the same limitations 
as the mean value of the distribution. It can be a useful index, however, 
in two ways: (1) to compare two groups in regard to their variability 
(for example, does one fifth grade class vary more in arithmetical achieve- 
ment than another?); (2) to find each individual s relative deviation from 
the mean of his group; through dividing each person's deviation by the 
mean deviation. In mamy situations, the mean deviation will suffice as 
an index of dispersion, particularly where no vau ue indexes 
are to be derived. It has the additional advantage oi being easily cal- 


culated and readily understood. 
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SrANpaRD Deviation. This index of dispersion is the one most widely 
used because it.is precisely determined mathematically, is a constant 
value of a normal curve, and is used in calculating other indexes, such 
as the coefficient of correlation, in transforming “raw scores" into "scaled," 
or weighted, scores (see p. 129), and in deriving the deviation intelligence 
quotient (see p. 139). 

The standard deviation (SD) is computed by summing the squared 
deviation of each measure from the mean, dividing by the number of 


cases, and extracting the square root. The formula for the standard 
deviation will help to clarify the process of calculation: 


— [Xx M)? 
SD =  — ca 


= is the symbol for summation; M represents the mean of the distribu- 
tion; X represents each individual score; N represents the number of 
cases in the distribution. 


When the scores in a distribution are normally distributed,’ the pro- 
portion of cases to be found within the SD ra 


nges are as shown in Table 
2.3 (see also Fig. 2.1). 


TABLE 2.3 


PROPORTION OF THE NORMAL DISTRIBUTION WITHIN 
GivEN SD DISTANCES FROM THE MEAN * 


SS SS ce SS O O 
Number of SDs Percent of cases 


1 68 
1.5 87 
2.0 95 
2.5 99 
3.0 99.7 


* To the nearest whole number, excepting 3 SDs. 


ing a normal or near- 


normal distribution, 
has a mean IQ and a standard 


deviation of 15, then 68 percent of the 


between 85 and i15; ercent between 
79 and 130; and 99-7 percent bet ee z 
; . ween and 
1000 would be €Xpected to fall Ls E a eg E a 


apparently, no deviation. 
tisfactory a roximation, 
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tion of a group, but also that when it is translated into percentages it 
enables us to locate an individual relative to the whole group in terms 
of percentage of scores he surpasses in that group. Thus, the standard 


deviation ties in with percentile ratings, briefly defined later in this 


chapter, and more fully discussed in Chapter 6.* 

QUARTILE DEVIATION. This index of variability, known also as the 
semi-interquartile range, is defined as half the spread of the middle 50 
percent of the scores when these are arranged according to size or ina 
frequency distribution. It is found by using the following expression: 


in which Q, is the point (or score) that is 75 percent up, and Q, is 
from. the bottom of the distribution. Thus, the highest 
and the lowest quarters are cut off, leaving the middle 5o percent. This 
simple device should be used only with the median (which can be repre- 
sented by Qs). While it has the advantages of being readily understood 
and not dependent on the type of distribution (normal, skewed, rec- 
tangular, etc.), it is not useful if statistical calculations are needed beyond 
it and the median. It is frequently employed, however, in representing 
various kinds of scores (intelligence test results, scores on educational 
achievement tests, school marks) for small groups? 

CoMPARABILITY OF RATINGS. In addition to the several reasons al- 
ready given for the use of measures of variability, there is still another 
very important one. Scores obtained on each of several different psycho- 
logical or educational tests are not necessarily comparable. One test 
might have a maximum of 100, a mean of 75, and a standard deviation 
of 10; another a maximum of 200, à mean of 125, and an SD of 20. 
Obviously, identical scores on these two tests have different meanings. 

For the most part, units of measurement in psychology and in educa- 
tion are not uniform throughout the scale, unlike units of physical meas- 
urement. Scores on 1 educational tests, therefore, are 


psychological and € 
more significant in terms of the relative rank they indicate than in terms 
of quantity.!? 


Measures of dispersion, es 
and decile ranks, and other c 
an individual's scores on two O 
translated into one of the derived indexes, 
performance on one may be compared with 

3 Standard scores, stanines, T-scores, and deviation inteligen oe are indexes 
based upon the standard deviation. They are explain poni ae a 

v Percentile and decile ranks, which belong to the y" of indexes as 


quartiles, are explained in Chapter 6. 
This principle is further explained 


25 percent up, 


pecially the standard deviation, percentile 
lerived indexes, are essential in comparing 
r more tests. The obtained scores are 
so that an individual's relative 
his performance on another. 


in Chapter 6. 
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Correlation 


Meaning. It is frequently necessary to determine the cert 
of relationship that exists between sets of scores representing two or more 
traits or abilities, or between sets of scores obtained for other reasons. 
For this purpose, the statistical technique of correlation is used. As will 
be seen in Chapters 4 and 5, the use of this technique is basic in the 
determination of a test's soundness, represented in terms of reliability 
and validity. 

For various scientific and practical reasons, educators and psychol- 
ogists have wanted to know the extent of relationship between. abilities 
in different school subjects (for example, arithmetic and reading com- 
prehension), between ratings on a test of intelligence and course averages, 
between intelligence ratings of siblings, and between children's height 
and weight. Are high, mediocre, and low levels of ability in arithmetic 
associated with corresponding levels in reading comprehension or in 
other subjects of study? How well do intelligence-test ratings predict 
quality of performance in school work at an 
ticularly, now, at the college level) 
similar in respect to intelligence le 
the intelligence levels of children a 
finding answers to these and ma 
method has been universally em 
Statistical analysis. 


Statistically, correlation is defined as the degree to which the paired 
scores of two ( 


or more) sets of measures tend to vary together. The meas- 
ure of the degree of concomitance is expressed as a coefficient of correla- 
tion that summarizes the relationship. 

The Pearson product-moment coefficient, designated by the symbol r, 
is the one most frequently used and is the one meant, unless otherwise 
designated. This coefficient may be of any size from zero to +1.00 or to 
—1.00. The sign of the coefficient 
there are high, moderate, or low c 
negative. Thus +1.00 indicates a perfect positive and —1.00 a perfect 
negative correla 
is low negative. 
the variables, 


y of the grade levels (par- 
? Do siblings tend to be markedly 
vel and other traits? How closely do 
gree with those of their parents? For 
ny other questions, the correlational 
ployed, together with other types of 


being meas- 
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between the measured variables, the coefficient will be negative.!! In 
the case of the Pearson coefficient, another requirement is that th 
relationship be linear. j 


TABLE 2.4 
PAIRED SCORES CORRELATING 995 

„u ——————— — 
Student Test score Freshman average 

A 192 93 

B 183 82 

C 181 80 

D 178 79 

E 176 75 

F 174 75 

G 173 74 

H 170 69 

I 165 64 

J 158 59 


Two simple tables of paired scores will illustrate different degrees of 
positive correlation. Note that in Table 2.4 the covariation of the paired 
act, they are almost identical, yielding a coefficient 


scores is very close; in f 
of .995. Figure 2.7; based on this table, shows the relationship graphically. 
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Each point, representing the position of one pair of scores, is on, or very 


nearly on, the diagonal line. If all points were located on the line, r 
would equal +1.00. 


TABLE 2.5 


PAIRED SCORES CORRELATING .65 


ees 


Pupil Test X Lest Y 
A 89 38 
B 86 40 
C 83 46 
D 82 28 
E 79 40 
F 76 34 
G 15 30 
H 74 36 
I 72 28 
J 63 24 


By contrast, Table 2.5 and Figure 2.8 represent a correlation coefficient 
of .65. The difference between coefficients of :99 and .65 is graphically 
emphasized by comparing the locations of the points. In Figure 2.8 they 
are more widely scattered: 


WOKS d; there is much less agreement between the 
relative sizes of the scores in each pair. 


The four charts in Figure 2.9 illustrate distributions of paired scores 
that would yield coefficients of four significantly different sizes: + 1.00; 
-60; .40; and .oo. The noteworthy 

50 characteristic of these charts is the 
extent to which the points (each 
representing one pair of scores) clus- 
40 ter about the diagonal or deviate 


45 


Y 35 from it. In chart a, all points are lo- 
: cated in the Squares (representing 
o equivalent values of X and Y) 
25 


through which the diagonal passes, 
thus indicating that each score X in 
any pair is of the same relative size 
as that other score Y in that pair. By 
fficient decreases, 
between the rela- 


o 
60 65 70 75 80 &5 90 
X 


as listed in Tables Brades less and less . 


l when T-—.00 i 
4 1S 
reached 


' any score in one variable 


CORRELATION 37 


may occur with any score in the other. In this case, there is no demonstrable 
or necessary relationship between the two sets of measures of the traits or 
abilities tested.!? Negative correlation coefficients of these same sizes would 
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tend, in varying degrees, to be associated with scores below the mean in 
the other.!? 

Error of Estimate. The fact that r varies from o to 1.00 and 
looks like a percent has at times been erroneously interpreted to mean 
that it represents percentage of accuracy in making predictions, that is, 
the percentage of frequency with which paired scores are found to be 
correspondingly high, medium, or low (with steps between these three 
levels). That this interpretation is incorrect is readily apparent in the 
several figures illustrating correlations, all of which, except the + 1.00 co- 
efficient, show some “error” of estimate or prediction, since not all points 
of paired scores fall, in these instances, on the straight line. It may be said, 
however, that r is related, but not proportional, to the average amount 
of error made in predicting one set of measures from another. 

A difference between an observed value and a predicted value is an 
“error of estimate.” The standard deviation of these “errors” is known 
as the standard error of estimate; it is regarded as a measure of the 
accuracy of estimate. Generally speaking, the larger the r, the smaller 
the standard error of estimate. The equation for this index is: 


SEs) = SD, Vi — r2 


. It is often more informative to know the standard error of estimate than 
it is to know the r alone. For example, in Table 2.5 the SD of test Y is 


3-3. The correlation is +.65. When these values are substituted in the 
DAE g 2.5. ps is interpreted to mean that in this distribution 
Beie Da dea errors will be within 2.5 (1 SE); 95 percent will be within 
deviation E ee mee will be within 7.5 (3 SE).* Thus, the standard 
eee distri ution and its degree of correlation with the second 

€termine the size of the error of estimate. 


Althou i isti : i 
i gh this statistical concept is not found in manuals of tests as 
€n as are correlation coeffi 


cients, i i 
necessary that the stud ts, it occurs often enough to make it 
student be familiar with its meaning. 
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ments. These measures consist of a value for each individual in a selected 
group. The collected data are concerned only with this particular group; 
they do not necessarily bear directly on the problem of what the results 
might have been if other groups had been measured. This kind of 
statistical study is often valuable, indeed essential, in the study of a 
variety of slicna and psychological problems when one is concerned 
only with a specific group. For example: What is the intelligence level 
of all children entering the first grade in a certain city in a particular 
vear? What is the arithmetical ability of pupils finishing the sixth grade 
in the schools of this city? What is the correlation between their achieve- 
ment-test scores in arithmetic and in reading comprehension? 

For the general purposes of research, however, data are collected and 
analyzed in order to derive facts and principles that represen AV gener 
population being studied. The results obtained m py pa SES pon 
it is hoped, will approximate the results ras or Fs S DERI 
sentative groups. But data [ound for any Spon uM ET s as 
an approximation to the facts, representative 0 ER a AR Nace UT 
which (obviously) cannot be measured in its entirety 1r ypes o 


problems. 

The data obtained i 
the total group that in sta 
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; desir 
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first-grade children from [amilies having 


TAA i . The first require- 
year, living in cities of 1,000,000 x n on * facts eye 
ment in undertaking a study involving t x : Eee etc bevvepreentatine 
ciples applicable to a whole population Hy ma pel sampling eh cr: 
sample of that population. There 2 e oed (involving random sam- 
random (simple and unrestricted) anc sus 


pling within specified groups): hod is one whereby each. member of the 
is me = 
RANDOM SAMPLING. This ance of being selected for 
popul te nila consideration has the same a a most d pro 
ation . ld requi s 
E t would M [ 
study as any other. For example, ! population of the United 


whole 
cedure to take a random sample n Med on a separate card; then the 
ave to 
States, Each name would have 


card ld have to be thoroughly and completely ehe Deore he 
ards woulc 


1. If it could be assumed that 
1 le were selectec. 
ones representing the samp i role Teing 
the Spend listing 


of names was not re 

3 could be shortened by pulling every nth card from 

studied, the procedure A ccu Un 
the filing case. Polling a 


ould I 
using 
: : sam 
school does not constitute random 


n any investigation represent a sample drawn from 
itistical terms is called the population or the 
ery broad, inclusive one, such as all 
United States in a given year. Or it 
ed; for example, all entering 


corners OY 
Qu pling. Whatever the method 


40 ELEMENTARY STATISTICAL CONCEPTS 


adopted, it is not random unless the definition of "random sampling" is 
satisfied. 

The conditions of simple random sampling are not always fulfilled in 
psychological research because of difficulties inherent in dealing with 
large, diversified, and geographically scattered populations. When it is 
not feasible to obtain a satisfactory random sample, another method is 
available. 

STRATIFIED SAMPLING. This method is the commonly used alternative 
to simple random sampling and is superior to it. The total "population" 
to be studied is divided into a number of nonoverlapping categories 
which, if taken together, include all persons. Each characteristic selected 
in "stratification" should be relevant to the variable to be tested. This 
first step is followed by picking cases at random from within each of the 
categories. The number from each category is in the same proportion 
to the total number selected as that entire category is to the total popula- 
tion (the universe being studied). More specifically, this is called “propor- 
tionate stratified sampling" and is most frequently used in educational 
and psychological work. For example, if the economic status and educa- 
tional levels of parents are related to the variable being studied, the 
stratified sample must include the correct proportions of children in the 
several economic levels, having correct proportions of parents at the 
different educational levels. The proportions would be determined by 
data in the national census. Within each subgroup, children would be 
drawn at random. To illustrate, all children in the communities selected 
for study whose parents are in the $4000-to-$5000 bracket and whose 
parents have both completed twelve years of schooling would have an 
equal chance of being included in the sample. The same principle holds 


for children in every subgroup designated as related to the variable. Note 
that random sampling is used within 


Sampling Errors. 
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i An ideal situation is one in which the mean and the standard deviation 
in a particular set of measures are known for an entire specified popula- 
tion. The ideal, however, is never reached, except in instances e 
the population (or universe) is a restricted and relatively small one. For 
example, the range, mean, and standard deviation of annual ion of 
all alumni of a given university who have been graduated 25 years or 
more. In lieu of the ideal condition, a number of different sample groups 


Populotion 


N Distribution 
N 
Y 


Distribution of 
V Means of Small 


Samples. 


ý 


FREQUENCY 


VALUE 


Fic. 2.10. Effect of sample size on sampling distri- 


bution of means 


are measured. The means of these groups, will, theoretically, be distrib- 
uted in the form of a normal curve. It is possible, then, to find the mean 
of the means and the standard deviation of the means, and thereby to 
estimate the probabilities that the mean for the whole population falls 
within one or more standard deviations in the distribution of the ob- 
tained means of the sample group. Figures 2.10 and 2.11 illustrate how 
N, the number of cases in the sample, affects the distribution of means; 
as the number in each group increases, the range of the means decreases; 
and, therefore, the precision of the estimated population mean is in- 


creased. 
An explanatio 
tion’s mean and SD from tho 


n of the procedures used in estimating a total popula- 
se obtained with a number of separate 
samples is not within the scope of this book. Nor is the converse of this: 
the estimation of whether the obtained mean and SD for a particular 
sample represent a random sample (within acceptable limits of error) or 
t a special group that differs in a systematic way 


whether they represen a difi : a 
from the total population. Details of these statistical techniques are avail- 
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Fic. 2.11. Distribution of means compared with 
population distribution 


able in textbooks on statistics. The purpose of this brief discussion is to 
call attention to sampling errors, which are important in the evaluation 
of the merits of psychological tests. 

Sampling Errors of the Coefficient of Correlation. Since co- 
efficients of correlation are based upon measurements of population 
samples, they, too, are subject to error. And, as in the case of means, the 
larger the sample, properly selected, the smaller is the error likely to be. 
The formula for the standard error of r follows. 


dew 
VN —1 


N represents the number of cases in the sample; r, represents the value 
of the coefficient for the entire population. Since that value is not actually 
known, statisticians have devised a neat method to obviate this difficulty. 
It is assumed that the bopulation coefficient is zero but that the coeffi- 
cients actually found for individual samples would not be exactly zero; 
Instead, they would distribute themselves symmetrically on both sides 
g iti would take the form of a normal 


Standard error of r — 


and; i ae 
Hence, it is 1 ard error 1s the standard deviation of the distribution of coefficients. 
2 S Interpreted in the same Way as the SD. j 
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chance sampling of the total population for which the true value is zero? 
The standard error of the correlation answers these questions in terms ot 
the probabilities of obtaining an r of a given size, with a given N. If the 
probability of finding a given coefficient merely by chance is only 1 Ed 100, 
or at the one percent level (the technical term), then it is relatively safe 
to conclude that the obtained r does not represent a chance relationship; 
indeed, that it is highly dependable. The following example iae 


the method. 

Assume that a test of scholastic aptitude correlates +.qo with college 
grades for a sample of 25 freshmen. Is this a mere chance result or is the 
coefficient representative of the presence of some degree of correlation be- 
tween these variables for the whole student population? Using the formula 
for the standard error of r (and remembering that r, = .00) the result is as 
follows: 1/24 =.21. Thus, from the properties of the normal curve, it 
can be predicted that 68 percent of the obtained coefficients would be 
between +.21 and —.21; that is, within one standard error of the mean 


coefficient of zero. Or, only 32 percent are expected to be greater than these 


limits. 

Although, as stated earlier, the one percent level is usually the desired 
level of significance in this kind of problem, the five percent level may 
be acceptable as adequate, depending upon the population being studied. 
Since five percent of the cases within the normal curve lie beyond 2-1.96 
standard deviations, we multiply .21 by 1.96, yielding .41. In this case, 
then, the five percent level is represented by correlations beyond the 
limits of +.41 and —.41. Thus, the hypothetical correlation of .40 could 
by chance slightly more than five percent of the 


be expected to occur 
much less significant than if it were at the one per- 


time. It is, obviously, 


cent level. 
t of .40 had been found for 101 stu- 


If, however, the sample coefficien o ha 
e 1/\/100 = .10. Applying the same 


dents,'* the standard error would b 
the five percent level of significance would be repre- 


sented by coefficients beyond the limits of 2-.196, while the one percent 
level is represented by 2.96. The chances are, therefore, only 1 in 100 
that a coefficient as large as .26 would occur by chance for this group of 
101 students. Since the correlation of .40 is four standard errors removed 
from the mean of zero, for this sample group, the probability that it 
would occur by chance is less than zin 1000. Thus, with a sample of 
about 100, or more, the conclusion is justified that the obtained coefficient 
indicates the existence of a significant correlation betweer the two vari- 
ables in the whole population being studied. 

The effect of sample size upon significant corre 


1 
o facilitate calculation. 
1 The N of 101 was chosen, 


reasoning as before, 


lations is illustrated in 


as is evident, in order t 
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Table 2.6, which lists the smallest significant coefficients (correct to two 
decimal places) at the one percent level, for selected sample sizes.!* 
Similar tables can be prepared, of course, for a study in which the five 
percent level would be acceptable. 


TABLE 2.6 


SMALLEST SIGNIFICANT CORRELATION COEFFICIENTS 
AT THE ONE PERCENT LEVEL 
OF SIGNIFICANCE 


ncque SS SSS 


Sample size Correlation coefficient 

50 36 

100 .26 

200 48 

300 15 

400 13 

500 12 

1000 .08 


ee ae 


This discussion of standard errors of correlation coefficients is intended 
to emphasize the fact that before the predictive value of a psychological 
test can be evaluated the data must demonstrate that the instrument is 
significantly correlated with the criterion variable. Since the standard 
error decreases steadily as the number of cases increases, it is necessary tO 
have a large number of individuals in the standardization group, espe- 
cially when obtained coefficients are between .40 and .60, as so often 
happens in psychological and educational research. 

The psychologist who is concerned primarily with individuals, rather 
than with group trends, must go beyond demonstrating statistical signif- 
icance. He must, by other means, evaluate the educational and. psycho- 
logical implications of statistically significant coefficients. This principle 
will recur in several connections in later chapters. 
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Definition of Psychological Test 


A dictionary definition of the verb “to test" states that it means 
Ni: y 
the subjection to conditions that show the real character of a person OT 
thing in a certain particular. It also has been stated that a test is a series 
of questions or exercises or other means of measuring the skill, knowledge, 
intelligence, or aptitude of an individual or group. 

These definitions apply to all psychological tests; but psychological 
tests are more than these definitions indicate. A dictionary of psycho- 
logical terms defines a psychological test thus: “A set of standardized or 
controlled occasions for response presented to an individual with design 
He elicit a representative sample of his behavior when meeting a given 

ind of environmental demand . . . it is now common usage to include 
as a test z Stats ne d 
p v Y "n + of situations or occasions that elicit a characteristic way of 
acting, wh sh 
indi S d T er or not a task, and whether or not characteristic of the 

dividual's best performance. Thu lbi an attitude 
survey i l F » s even a self-inventory or an 
y is called a test” (3) 
A concise definiti nh 
s nit - , ; 5 
salt 4 hnition then, is this: A psychological test is a standardized 
instrument designed to measur, LH S 
personality by mean asure objectively one or more aspects of a total 
a a 
means of other bel s of samples of verbal or nonverbal responses, or by 
ehavr H H ENT 
run OUS viors. The key words in this definition are standard- 
» objectively, and samples. Their connotati leni ill b 
elaborated in the c ; ations and significance will be 
ourse of this and following chapters. 
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Although tests of general intelligence, specific aptitudes, educational 
achievement, and personality are designed for their own particular pur- 
poses, all of them have certain basic principles in common and have been 
constructed by certain common procedures. Nor are they mutually exclu- 
sive, for any combination of tests might be used in studying a specific indi- 
vidual or in attempting to solve a particular psychological problem, either 
practical or theoretical. But probably the most important use of tests has 
been their contribution to the analysis and description of an individual's 
characteristics; to the evaluation, prediction, and guidance of his educa- 
tion and behavior; and to sounder determination of his vocational prep- 
aration and selection. 


Objectivity 


The purpose of standardizing a test is to give it objectivity, that 
is, to devise an instrument that, so far as possible, will be free from subjec- 
tive (personal) judgments regarding the ability, skill, knowledge, trait, or 
potentiality to be measured and evaluated. There are several elements, or 
aspects, that make a test objective. These are: $ 


Everyone who administers the test does so according to a uniform and 


specified set of instructions. 

The responses to test items are uniformly scored according to specific 
answers, or specimen answers, provided in a manual. 

Norms of performance are based upon a population sample that has been 
scientifically selected for the purposes of the particular test. 

The mental activities or the personality traits to be tested are defined and 


specified, and the psychological rationale is given.? 
The activities or traits to be tested have been selected on the basis of 


analyses of the operations or behaviors to be evaluated, upon the views 
of a number of experts, and upon information available from previous 


research. 
The content of the test under construction is subjected to analysis by means 


of established techniques of test standardization. 


Objectivity in Administering and Scoring. Each psychological 
test is administered under a prescribed set of procedures. These instruc- 
tions prepare the respondent by means of introductory and explanatory 
remarks. The phrasing of instructions for presentation of each part (or at 
times, each item) is prescribed and time limits, if any, are set. Instructions 
are provided as to when directions should or should not be repeated, when 


1In the case of projective tests, the responses are analyzed in accordance with certain 
given principles. In some instances, scores are added to the analyses; in others, they are 


not. 3 
a This is not entirely true of all tests. It is true, however, of the sounder ones, 
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encouragement should be offered by the examiner or silence maintained, 
and when questions from the examinee should or should not be answered. 
If practice exercises are used, these are given in the manual, and usually 
in the body of the test itself. These prescriptions are intended to create 
conditions that are as nearly uniform as may be achieved for all persons 
taking the test. Thus, scores and ratings derived under these conditions 
are not subject to the individual judgment of the test administrator. 

A key, provided with the test, is used to score the responses; or, as in 
the case of the Stanford-Binet and the Wechsler scales, the scoring criteria 
are defined, specified, and illustrated so that subjective judgments of indi- 
vidual examiners do not enter or are reduced to a minimum. By these 
means, an objective test provides uniformity in the scoring of responses 
as well as in methods of obtaining them, and results found by one com- 
petent examiner are comparable with those obtained by others. 

The most objective kind of scoring is that of group tests, graded by 
hand with a stencil or by an electronic machine. When either of these 
means of scoring is used, a response is either right or wrong. The hand 
stencil can be used by a clerk who knows nothing about psychological 
testing, but who must be able to count correctly, The scoring machine, 


however, requires a skilled operator, but one who needs to know nothing 


about testing either; nor need she be able to count, since the machine 
performs that task. 


While highly objective, neither of these methods is free of error. The 
clerk, scoring with a hand stencil, might make some mistakes. A score 
k a h j b i i 
obtained with an electronic machine may be incorrect if the examinee did 
not make the necessary m 


arks (with an "electroni il") exactly right. 
In a later chapter the adva LIUM. } ; 


3 s - liscerned by a 
Sychological i il dias ds C min 
m d p examiner, especially when an individually administered 
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intended for the first three grades; or for grades eight through twelve; or 
for college freshmen; or for other grade ranges, depending on the school 
subject and the prescribed scope. 

"Tests of specific aptitudes, similarly, are designed for specified popula- 
tions. For example, one test of ability in art is designed for grades seven 
and above; one test of mechanical aptitude is to be used for ages 8 to 21; 
a law aptitude test is standardized for college students and others regard- 
less of age, who are candidates for admission to a law school. 

Rating scales, personality inventories, and projective tests are likewise 
intended for use with a specified segment of the total population. They 
may be designed for a selected age group; for particular occupations; for 
given educational levels; for one sex of limited age range; for the diagnosis 
of clinical cases, or for use with nonclinical populations as well. 

In any event, whatever the traits or functions to be measured, what- 
ever the range of ages or school grades, and whether for clinical or non- 
clinical groups, the test must be standardized on a group that is a rep- 
resentative sample of the total population for which it is intended. Each 
test must be constructed by actually sampling the performance or re- 
sponses of an adequate group that is typical of its population. 

The nature and the comprehensiveness of the test under construction 
will determine what factors are important in making a population 
sampling. In any instance, the sample should yield unbiased data on the 
population it purports to represent; and the sample should be large 
enough to provide statistically valid results for the traits or functions 
being measured by the test. 

This means, of course, that the author of a test must decide at the 
outset with which group, with what segment of the population, his in- 
strument is to be used. Then he must standardize his test on a popula- 
tion sample that is stratified according to relevant factors; and within 
each stratum the selection of cases should be adequate in number and 
of correct proportion in the total (13). 

For example, if a psychologist is constructing a test of general in- 
telligence for American children in the primary grades, ranging in age 
from 5 to 9 years, he will have to incorporate the following factors in 
obtaining his standardization population: age, sex, geographic area, pa- 
rental occupation level, and type of community (urban, village, farm). 
The author of the test must decide, also, whether he will standardize his 
test entirely on a Caucasian population, or whether he will include non- 
Caucasian elements. If it is to be the latter, then the racial factor must be 
taken into account in obtaining the sample. 

Since individuals within a representative sample of any given age vary 


5o BASIC PRINCIPLES 


widely in respect to mental abilities, some will reach only the levels of 
younger age groups while others will attain the levels of older groups. 
Thus, to ascertain the developmental level of the retarded, it is necessary 
to extend downward the chronological age of the standardization sample; 
and, conversely, it is necessary to extend upward the age limit for the 
superior. The validity of the results obtained with any psychological test 
will depend, in part, upon the adequacy and representativeness of the 
standardization population. 

There are two major kinds of population sampling, as already ex- 
plained in Chapter 2, and precautions must be taken to avoid obtaining a 
biased sample. It has been demonstrated that simple random selection 
very often fails to yield a representative group, so this sampling pro- 
cedure should not be used for test standardization. On the other hand, 
stratified sampling is now commonly used in constructing psychological 
tests. The population to be tested is divided (ideally) into a number of 
nonoverlapping categories which, together, will represent the entire group. 
Then, as already explained in Chapter 2, individuals are selected at 
random within each category. The number drawn from each category 
must be of the same proportion in the total sample as that category is in 
the entire group under consideration.? 

A two-category division, for example, male and female, would be the 
simplest case. In the stratified sampling, the two sexes would be repre 
sented in the same proportion as they occur in the whole group (the uni- 
verse) from which the sample is being drawn. Similarly, if a test is to be 
constructed for a specified age range, the porportion of each category 
(sex, age, school grade, socioeconomic status, locale) would have to be 
determined for inclusion in the sampling. It is obvious that the problem 
of stratified selection becomes more complex and more difficult a5 the 
number of categories is increased. There are times, therefore, when au- 


thors of tests must be satisfied with close approximations to existing pro- 
portions.* 

* Although the term “ 
priate, Strata are layers, usually in a hori 5 - 
i rr ; zontal or near-horizontal plane, one supcr 
DOSE upon another. It is obvious that the categories into which a leavers is divided 
requently fail to satisfy t E 

"Throughout i 

been PORC uada: pare proportional representation has 

i 3 of the estimated mean of a total 
» “disproportional sampling’ 
um separately; then their re- 
» to represent the population as a whole. 
Statistical Methods in Market Research. 
duction to Statistical i » 1949, Ch. 6; also, P. J. McCarthy, Intro- 
So ica Reasoning. New York: McGraw-Hill Book Company, Inc, 1957» 
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Sampling of Traits and. Functions 


Any given test measures a limited aspect of the person being 
examined, although some tests are much more restricted in scope than 
others. It is essential, therefore, that the test builder define the aspects he 
proposes to measure. After doing this, he must develop a series of test 
items that will best sample the traits or functions with which his test is 
concerned. 

In developing a psychological test, it is impossible, and in fact un- 
necessary, to use an unlimited number of items. It is not necessary to 
attempt to present the individual being tested (called the “subject” or 
the "testee") with problems that will measure his responses for every 
conceivable situation involving a given trait or function. It is sufficient 
to get an adequate sampling of responses in a particular area or range of 
behavior, the assumption being that the sampling is representative of 
the whole. 

Two kinds of sampling are actually involved in constructing a psy- 
chological test. First, the most relevant constituents of the gross variable 
(the broad, comprehensive trait or function) must be selected. Where, 
for example, the gross variable is general intelligence, the constituent 
parts in the test might be: vocabulary, verbal comprehension, arithmetical 
problems, reasoning with practical problems, verbal and other analogies, 
perceptual organization, and so forth. Second, the operational levels (that 
is, the actual items) must be selected: which arithmetical processes and at 
what levels, what kinds and which levels of words, what types and ranges 
of situations, which perceptual figures? 

In following this procedure, psychologists are employing a well-known 
and widespread technique. If a chemist wishes to determine the quality 
of a shipment of milk, he takes small quantities here and there, combines 
these, and then analyzes a sample of the samples. If an agronomist wishes 
to analyze a given area of soil, he gathers small amounts from various 
spots. If a blood test is to be made, a very small quantity taken from one 
place is sufficient and representative of the entire stream. Numerous other 
illustrations can be found. So, too, with intelligence, specific aptitudes, 

ersonality traits, and school achievement. It has been said that psycho- 
logical testing may be thought of, figuratively, as sinking shafts here and 
there within a given range in order to measure depth and evaluate 
quality. 1 
Intelligence Tests. Specifically, for present purposes of illustra- 


tion, intelligence may be defined in several ways: (1) capacity to integrate 
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experiences and to meet a new situation by means of _— on 
adaptive responses; (2) capacity to learn; (3) capacity to carry on an a ; 
thinking. Although psychologists differ in regard to which of these t : ee 
aspects is the most important and which they would emphasize, the fact 
is that most tests of general intelligence probe and sample all three. The 
following types of items found in various current tests fall under one or 
more of these definitions; are constituent parts of the gross variable, gen- 
eral intelligence; and illustrate the operational aspect. 

Practical reasoning: 
something which 
scale) 

Definitions of words: that is, 

Perceiving similarities and differences between objects. For example: "In 
what way are wood and coal alike?" "In what way are a baseball and an 


orange alike and in what way are they different?" (From the Stanford- 
Binet scale); that is, abstraction and generalization 
General information tests: assimilation and retention of experiences 
Arithmetical reasoning: reasoning with abstractions 


Supplying missing parts to pictures: perceptual analysis and integration 
Reproducing geometric figures from memory: v 


"What's the thing for you to do when you have broken 
belongs to someone else?" (From the Stanford-Binet 


concept formation 


isual imagery and organiza- 
tion 
Arranging a series of pictures in logical Sequence: visual perception and 
reasoning 


Perception of form desi 
and organization 
Explanation of absurdities in 


En: visual imagery and recall, perceptual analysis 


given pictures: logical analysis of visual per- 
Cepts 
Oral solution of practical problems orally’ Presented: analysis and gen- 
eralization 


Solving problems involving 
and pencil): Spatial orien 
Deriving and 
abstractions 


: "WE 3 er 
distances and directions (without use of pap 
tation 


z d ; ith 
giving the meanings from a prose passage: reasoning Wit 


cy Vocabulary (word 

3- Space perception (perceiving similarities of, and differences among 
Beometric figures) 

4. Word fluency 

- Reasonin 

.M 


meaning) 


(controlled word association) 
& (insight into patterns of letters arranged in series) 
emory (immediate recall of discrete verbal materials) 


oo 
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Tests have been constructed on the basis of this analysis, items having 
been devised for each of the six categories. 

Still another approach to analysis of test content is known as construct 
validity (see Chapter 5). This term means that a test measures what it 
claims to measure if the mental processes (activities) required by the test 
items sample well the concepts, or constructs, that the test is designed to 
measure. Obviously, when a test's author uses this approach, he must be- 
gin by selecting and defining the concepts to be included. To do this 
adequately requires insights into psychological operations and knowledge 
of the mental activities involved in the situations in which the test is to be 
used. For example, in one series of group tests of intelligence, using the 
principle of construct validity, intelligence is analyzed and the test items 
are based upon the following concepts (8): 


The tasks deal with abstract and general concepts. 
In most cases, the tasks require the interpretation of symbols. 
In large part, it is the relationships among concepts and symbols with 


which the examinee must deal. 
The tasks require the examinee to be flexible in his basis for organizing 
concepts and symbols. 


Experience must be used in new patterns. 
Power in working with abstract materials is emphasized rather than speed. 


These concepts are then given form by determining which types or 
categories of materials should be employed. The authors of these tests 
decided on a nonverbal battery including figure classification, number 
series, and figure analogies; and on a verbal battery including sentence 
completion, verbal classification, arithmetical reasoning, and vocabulary.5 

The next task, a difficult one, is the preparation and experimental selec- 
tion of numerous individual items to give substance to each of these 
categories. This last task constitutes the technical aspects of test stand- 
ardization, discussed in the next chapter. 

The outline that follows is a further illustration of the trend among 
psychologists toward analysis of a gross variable into its component parts. 
In this instance, the investigator has set himself the task of testing rea- 


soning (4). 
Reasoning I 
a. manipulating symbols 
b. solving problems 
c. defining problems 
d. testing hypotheses 


Reasoning II 
a. seeing rules 


5 These and other types of 


or principles (induction) 
materials are illustrated in subsequent chapters. 
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. seeing systems 

. seeing trends 

. seeing relations (educing relations) 
. seeing identity of relationships 

f. analyzing forms 


&» & ^ c 


Reasoning III 
a. seeing common elements or properties 
b. classifying (in general) 
c. classifying forms 
d. educing correlates 
Reasoning IV 
4. drawing inferences (deduction) 
b. syllogistic reasoning 


Inspection of these four types of reasoning shows they are not independ- 
ent of one another. Yet if it can be demonstrated that these types are 
sufficiently distinct and constitute reasoning in its several aspects, and if 
reasoning were to be measured according to this scheme, it would be 
necessary to devise items for each of the four types and for each subtype. 
Ultimately, too, research would have to demonstrate the validity of these 
types as tests of reasoning when compared with one or more external 
criteria.® 
Specific Aptitudes. A specific aptitude test indicates the prob- 
able degree of successful learning and achievement in a particular and 
limited type of activity; for example, musical, graphic arts, mechanical, 
clerical, linguistic. A test intended to estimate a person’s capacity an any, 
specific area must include parts (called subtests) and items, sufficient 1n 
number and extensive enough in scope and level of difficulty to provide 
an adequate sampling upon which a prediction of subsequent learning 
€ based. Without comment on their merits at this 
» Clerical, and mechanical areas 


° This wo 


uld h 
validity (see Chan. 2 Pe 


e Chapter 5). 
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aspects might be included in a test: knowledge of tools, knowledge of 
mechanical devices, manual dexterity, perception of spatial relations, 
knowledge of terms (mechanical vocabulary), rate and accuracy of tap- 
ping, mechanical comprehension, and reasoning. The "MacQuarrie Test 
of Mechanical Ability" includes several aspects: tracing (drawing a line 
through a series of broken lines), rate of tapping within a series of small 
circles, dotting (the same as tapping, but the circles are much smaller and 
irregularly spaced), copying simple jagged lines on a square of dots, loca- 
tion (matching parts of a large square with corresponding parts of a much 
smaller square), block analysis and block counting, and pursuit (following 
each of ten intertwined irregular lines). The first three parts are intended 
to measure rate and accuracy of eye-hand coordination, while the remain- 
ing four are devised to measure spatial perception (9). These tests do not 
make any demands upon any form of mechanical comprehension, in- 
formation, or skill. It is MacQuarrie's view, apparently, that eye-hand 
coordination and spatial perception are basic and most significant in 
learning and functioning in certain mechanical occupations. 

The tests of mechanical comprehension developed in the United States 
Air Force during Worid War II were of a different kind. These included: 
mechanical principles, requiring understanding of basic principles funda- 
mental to the solution of mechanical problems; mechanical functions, 
requiring knowledge of tools and instruments and comprehension of 
the methods of machine operations; and mechanical movements, requir- 
ing ability to comprehend and follow the operation of moving parts of 
machines (5). 

Personality Inventories. These tests must also be based upon 
samplings of the traits that the test's author proposes to evaluate. This is 
true even though so comprehensive a concept as "personality" is difficult 
to define, and even though nonintellective traits of personality are at 
times elusive. There are inventories, answered by the individual him- 
self, that are intended to evaluate degrees of introversion-extroversion, 
neurotic tendencies, security-insecurity, anxiety, hypochondriasis, dom- 
inance-submission, adjustment to home and to school, and others. In 
each instance, the author of the test must define the area and traits and 
then provide questions or statements that represent the manifestations Or 
symptoms of the trait. Some inventories are devised to evaluate a single 
bipolar characteristic (frequently called a dimension"), such as security— 
insecurity (for example, the Maslow Inventory); others are multiple-trait 
inventories (for example, the Bernreuter). : 

The Maslow Security-Insecurity Inventory provides a useful illustra- 
tion of the definition of a trait in terms of its syndrome; that is, the 
aspects that combine in one degree or another to create this bipolar 


56 BASIC PRINCIPLES 


trait. Maslow lists fourteen aspects of security, and the opposite pole of 
these fourteen as aspects of insecurity. The first three in each list are re- 
garded as relatively prior, or causal, whereas the remaining eleven are 
consequent, or are effects produced in the course of an individual's de- 
velopment. The three causal aspects, at each pole, are as follows (10). 


Security Insecurity 


1. Feeling of being liked or loved, 


of being accepted, of being looked 
upon with warmth. 


1. Feeling of rejection, of being un- 
loved, of being treated coldly and 
without affection, or of being hated, 
of being despised. : 

2. Feclings of isolation, ostracism, 
aloneness or being out of it; feelings 
of "uniqueness." 

3. Constant feelings of threat and 
danger; anxiety. 


2. Feeling of belonging, of being at 
home in the world, of having a 
place in the group. 


3. Feeling of safety, rare feelings of 
threat and danger; unanxious. 


The Bernreuter Personality Inventory is designed to evaluate degrees of 
Six traits: neurotic tendency, self-sufficiency, introversion-extroversion, 
dominance-submission, confidence in one's self, and sociability (1). Each 
of these has to be defined and described. The rating on self-sufficiency, for 


ng the extent to which the individual pre- 
thy or encouragement, and tends to ignore 
to introversion-extroversion, those scoring 
€ said to be imaginative and to live within 

g low (extroverts) are said to worry rarely, 
to suffer seldom from emotional upsets, and rarely to substitute daydream- 
ing for action. 


The several types of projective tests of person- 
ifferent origins. Rorschach, after whom the ink- 
gın with a set of personality traits, arrived at 


r, which he wanted to evalua 
apter 1 
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The Murray Thematic Apperception Test originated in a different 
manner (11). (See Chapter 25.) It derives from and is based upon the 
general theory of personality developed by H. A. Murray and his col- 
leagues at the Harvard Psychological Clinic in studying "normal" per- 
sons. This projective test rests upon analyses of human needs and of the 
environmental forces (called "press") affecting human behavior. Murray 
lists twenty-six needs and sixteen environmental forces to be elicited by 
the series of ambiguous pictures, drawn for this purpose, constituting the 
Thematic Apperception Test. Among these needs are the following, cited 


for illustrative purposes (12): 


Achievement: To work at something important with energy and per- 
sistence. To strive to accomplish something creditable. To get ahead in 
business, to persuade or lead a group, to create something. Ambition 
manifested in action. 

Dominance: To try to influence the behavior, sentiments, or ideas of 
others. To work for an executive position. To lead, manage, govern. To 
coerce, restrain, imprison. 

Intragression: To blame, criticize, reprove, or belittle [one's self] for 
wrongdoing, stupidity, or failure. To suffer feelings of inferiority, guilt, 
remorse. To punish [one's self] physically. To commit suicide. 


Among the "press" listed, and for whose presence the individual's test re- 
sponses are analyzed, are these: 


Affiliation (emotional): A person (parent, relative, lover) is affectionately 
devoted to the hero.? 

Dominance (coercion): Someone tries to force the hero to do something. 
He is exposed to commands, orders, or forceful arguments. 

Rejection: A person rejects, scorns, repudiates, refuses to help, leaves, or 
is indifferent to the hero. A loved object is unfaithful. The hero is un- 
popular or not accepted for a position. He is fired from his job. 


At this stage, it is not our purpose to evaluate the abilities, aptitudes, 
and traits being measured or sampled by means of the tests. Our purpose 
is to indicate the several courses followed by psychologists in deciding 
upon the content of their tests. 

Educational Achievement Tests. A test in this category is de- 
signed to measure an individual’s understanding of, skill with, or informa- 
tion in a given subject of study—or all three. They may and do encom- 
pass almost the whole range of subjects taught in elementary and second- 
1 some taught at the college level as well. In each instance, 


ary schools, and 
as in all other types the scope of the test must be defined, the divisions 


7 The hero in each story is the central figure about whom the story is told in response 
to each picture. The theory is that the respondent identifies with the hero and through 
him als his own needs and the environmental forces affecting him. 
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of the subject must be determined, and the elements of each division must 
be adequately represented in the final version of the test; for its validity 
will depend upon the adequacy with which it samples its subject-matter 
field. 

"Tests of educational achievement are based primarily upon content 
validity (see Chapter 20). This term means that the effectiveness and value 
of a test—the extent to which it measures what it claims to measure— 
depend upon the comprehensiveness and soundness of the materials it 
includes. In this connection, the question to be answered is this: Are the 
areas covered and the choice of topics based upon a sound and expert 
analysis of what should be included? A satisfactory answer to this ques- 


tion will depend upon the expertness of the persons constructing the 
test. 


In the first place, an educational achievement test must be based upon 
the stated objectives of instruction in that subject of study. For example, 
here are the objectives in the first course in secondary school algebra, as 
defined in one state. If these objectives are common throughout the 
United States, a test based upon them could be widely used. Otherwise, 
the test could be appropriately used only in schools that share the same 


objectives (16). 


1. Acquisition of the basic vocabulary dal 
2. Learning to translate quantitative statements into the language 9 
gebra 
9. Interpreting the solution of cquations where they have significance 
in using rules of equality and transformation 
4. Solving general verbal problems using as a means of solution the table, 
graph, formula, and equation id 
5. Understanding of carefully considered concepts and principles whica 
should lead to fundamental skills and techniques " 
a. The four fundamental operations involving positive and negative 
numbers, algebraic monomials or simple polynomials, and algebraic 
fractions, mainly monomial denominators E 
b. Special products and factoring, such as squaring a binomial, finding 
the product of the sum and the difference of two terms, factoring à 
polynomial containing a common monomial term, factoring trinomials 
of the form x? plus bx plus c, and factoring the difference of two 
Squares 
c. Powers and roots involving the laws of exponents and their use, 
Square roots of positive numbers, and fundamental operations involv- 
ing radicals of the monomial type 
6. The study of relationships and of dependence 
a. Interpreting tables of related number pairs 


b. Making graphs based on tables of related number pairs and using 
graphs in the solution of problems 


and 
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c. Using formulas as means of expressing relationship or dependence 

d. Equations involving the solution of equations of the first degree in 
one unknown, fractional equations, equations of the form ax? equals 
b, and simple radical equations 

e. Using equations in the study of proportion and of variables 


The Stanford Achievement Test (Advanced Battery) includes subtests 
which are intended to measure pupils’ progress in several aspects of 
English (6). These are as follows: 

Paragraph meaning (involving some recall, but primarily interpretation) 

Word meaning (involving definitions and, to some extent, reasoning) 

Spelling 


Language (involv 
structure, grammar, and usage) 


ing capitalization, punctuation, knowledge of sentence 


Educational achievement tests are generally used to determine the 
learner's level of proficiency at the time of the examination. The results 
obtained with these devices are also useful in forecasting each individual's 
probable future level and quality of learning in the several subject-matter 
areas, and in diagnosing specific deficiencies and disabilities in school 
learning. The basic importance, therefore, of adequately conceived and 
satisfactorily standardized instruments of measurement is readily ap- 


parent. 


Resemblance of Test Items to Actual Behavior or Experience 


Inspection of the types of test items already cited, and of those 
given in later chapters, shows that the degree of resemblance between the 
tasks presented by the items and actual behavior or traits to be discerned 
d varies with the several kinds of instruments. Tests of educa- 
tional achievement, because they are measures of actual learning in 
specific subjects of instruction, utilize items that are samples of acquired 
or developed skills. These tests seek to answer such questions as: How 
much information has the testee acquired in American history of a given 
period? How well can he perform arithmetical processes of a certain level 
of difficulty? How much does he know about punctuation or grammatical 
usage? In other words, educational achievement tests measure directly 
that which they are intended to represent. i 

This is true, also, of some aptitude tests, as, for example, those in music 
that measure the several forms of sensory acuity, those in comprehension 
of mechanical principles, and those in law that present cases and problems 
of the sort studied in law schools. On the other hand, tests of spatial 
perception, speed and accuracy of eye-hand movements are indirect meas- 
ures of some aspects of mechanical aptitude, since they are intended to 


or predicte 
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provide signs or symptoms of a general type of functioning, bits pe 
direct measures of an activity for which persons are being selected by 
means of the test. ; 

Tests of general intelligence, with a few exceptions, are concerned with 
the forms, complexity, level of difficulty, quality, and at times, rate of 
mental activity, rather than with the specific content. When, for example, 
an author of an intelligence test uses problems in similarities and dif- 
ferences,’ or synonyms-antonyms,? he is not concerned primarily with ilie 
particular objects or words being compared. What he wishes to, den are 
the mental processes involved in reaching a correct answer. W hen an 
author decides to include rote or logical memory in his test, the particular 
series of digits or words (whether in a disconnected series or in a mean- 
ingful sentence) are unimportant as long as they meet certain elementary 
conditions, such as length and familiarity. What he wants to measure is 
“memory span,” or “immediate recall,” since this form of mental func- 
tioning is regarded as significant in the more general aspects of in- 
telligence.10 , 

The items in a subtest of general information, when included in tests 
of intelligence, are not selected because they are most worth knowing or 
deserving of more attention than others are. They are chosen in order 
to provide a measure of an individual's range of intellectual curiosity and 
activity and of his assimilation and retention of experiences; for these 
aspects are among those symptomatic of intellectual level and quality. 
Items in arithmetical reasoning are included to obtain evidence of com- 
plex reasoning ability with the use of abstractions, rather than to find 
out specifically and primarily the testee's proficiency in arithmetic. 

All the types of subtests incorporated in tests of general intelligence 
should be similarly viewed. The fact that specificity of content is not of 
primary significance can be readily determined by comparing and noting 
the differences among the items of several different instruments, each of 
Which is intended to measure the same mental functions and to serve the 
Same purposes. 


Personality inventories and projective tests present still different forms 


of content, so far as correspondence with actuality is concerned. Inven- 


tories consist of verbalizations of a variety of b 


* How are a 
°A 


ehavioral situations or of 


and how are they different? 

words. The testee may be required to 
» or nearly the same, as the first word. 

Or he may be i 


"For examp 


le: Do you etu t ily? Do v 
liked well eno you get upset easily? Do you 


have headaches frequently? Are you 
ugh at home so that you feel happy t 


here? 


——  — 
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conditions experienced.? Thus, they are representations of actual be- 
haviors; but they are as close to actual behavior and experience as can be 
approached without observing a person in the behavioral situations 
themselves.!? In projective tests, the closeness or remoteness of content 
materials varies with each instrument. The pictures presented in the 
Thematic Apperception Test are intentionally ambiguous. They are not 
designed to represent situations that might have been experienced by 
many persons taking the test. For example, one picture shows a young boy 
resting his head on his hands, looking at a violin lying on the table. Many 
of the T.A.T. pictures present situations which, if taken literally, are 
remote from actual experiences of most persons; others are thoroughly 
ambiguous, or even products of fantasy. The purpose of these pictures is 
not to offer situations representative of those actually experienced by 
testees but, rather, to present a variety of situations wherein each indi- 
vidual can impose his own interpretations. As a matter of fact, some of 
the projective tests portray animals rather than persons, the assumption 
being that the testee will more readily respond to animal pictures than to 
those of humans in these situations.’ Then, by contrast, the Michigan 
Picture Test, for children, and the Symonds Picture-Study Test, for 
adolescents, represent a number of situations common to these age groups, 
the Michigan being more specific in its representation than the Symonds. 
The purpose of both is to elicit such responses to each situation as will 
reveal attitudes toward, feelings about, and values concerning persons, 
groups, and institutions in the environment. 

The Rorschach Inkblot Test, by contrast with all others thus far 
mentioned, utilizes content materials that are completely unrepresenta- 
tional. Inkblots are ordinarily not encountered in learning situations. 
They provide unfamiliar visual percepts upon which the respondent ex- 
ercises his imagination; and the products of his imagination are analyzed 
for evidence of his personality characteristics. >. 

In discussing and briefly illustrating the degree of similarity of test con- 
tent to the actualities of learning, behavior and personality traits, the 
purpose is to demonstrate the several methods ot obtaining psychological 

ending upon the traits or functions to be assessed 


information, each dep fun 
and upon the objectives to be served by each instrument. 


13 There have been experiments and observations in ee ie p ua school 
rooms, and elsewhere, in which certain conditions are simulated to evoke degrees of 
ati aggressiveness, conformity, leadership, etc. Some of these experi- 
ie Cape valuable results; others have suffered seriously from the fact that 
eios have pod dituatiops are not transferable to a laboratory or other type of con- 
some actua 


trived setting. t and the Blacky Pictures (see Chapter 26). 


13 The Children's Apperception Tes 
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TEST STANDARDIZATION: 
PROCEDURES AND RELIABILITY 


The fundamental purpose of standardizing a psychological 
test is to establish its reliability and its validity at as high a level as pos- 
sible. The techniques of establishing reliability and validity are dis- 
cussed later in this and the following chapters. To begin with, the steps 


taken in devising a test will be explained. 


Content 


The first questions the author of a test must answer are these: 
Which abilities, aptitudes, proficiencies, or personality traits are to be 
measured? ! Having determined that an instrument is to be constructed 
for the measurement or assessment of certain specified functions, be- 
the author must then make his own analysis of their 
constituent elements or utilize analyses previously made by others and 
reported in the scientific literature on the subject. Examples of such 
analyses were given in the preceding chapter.? It often happens that spe- 
cific job analyses must be made preparatory to devising a proficiency test. 


that the author of the contemplated test has answered this 
aving decided that an instrument was needed for 


haviors, or traits, 


1 The fact is, of course, 
question for himself at the outset, h 
one purpose or another. 3 . 7 l 

? Methods of determining which elements are the most appropriate for inclusion are 
discussed further in later chapters in connection with definitions and theories of in- 
telligence, the nature of specific aptitudes, aspects of personality, and the rationales of 


specific types of tests. 65 
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In other instances, considerable exploratory research, based = 
upon psychological insights, must be carried out before z rh pep sim 
identify, even tentatively, the elements that should be incluc ed "m 2 

The number and kinds of subtests and the nature of the individua 
items themselves will be determined in large part by the test author's 
theoretical position. For example, in devising a test of “general ability; 
his position with regard to the general factor will influence the conent 
and structure of the test (see Chapter 7). The author's views regarding 
cultural influences will be another consideration (see, for example, the 
Davis-Eells test, Chapter 15). In devising a test of personality, specially 
the projectives, one’s theory will influence the selection of traits to be 
assessed and the manner of assessing them (see Chapter 23). T esting for 
proficiency in specified jobs is the area in which theoretical difference 
play a minor role, if any, because both the available tests and the new 
ones being devised will be based upon a job analysis of the activities 
actually involved. 

After he has decided which areas are to be tested, the author must pre- 
pare a large number of items to be used in the initial, or esploratory; 
stage of test construction. It might be necessary to begin with three or 
four times the number of test items that will be included in the final 
product. The exploratory and second stages of standardization will show 
that some items are of little or no value for the purposes of the test, while 
others will be eliminated from the final version because they are super- 


flous even though, in themselves, they might be relevant to the purpose. 
The actual kinds of items utilized will 


purpose of the instrument, Shall they be 
of achievement tests? Or shall they prim 
whose actual content is a secondary consi 
intelligence? Or shall they sample behavio: 
Or shall they evoke “signs” 
After areas, age ranges, 
studied are determined, the 


depend upon the nature and 
direct measures, as in the es 
arily involve mental processes 
deration, as in tests of general 
r, as do personality inventories? 
of personality traits, as do projective tests? ` 
levels of difficulty, or kinds of traits to be 
initial set of test items will be tried out on 
ps to formulate and clarify instructions, to 
asing, to learn whether each item evokes the 
and to decide on time limits. It is also valu- 

trospective reports regarding the whole test 
and particular items in it. 
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went through seven versions before the satisfactory form was con- 
structed.) Results obtained with the preliminary form will provide statis- 
out the test as a whole and about each individual item. On 


tical data ab 
analyses can be made to determine the test's re- 


the basis of these data, 


liability and validity at this stage. 
Other valuable statistical findings are range of individual differences 


among scores, norms and deviations of scores at each of the several ages 
or ability levels, difficulty level of each test item, and the degree to which 
each item or subtest discriminates between those who score high and those 
who score low on the test as a whole.* The data provided by this process 
will be used to prepare a revised version for further trial with another 
large stratified sample of the group.* For example, the Lorge-Thorndike 
test manual states: “The initial series of tryouts involved more than 6000 
pupils. Each of 1200 different test items was tried out on 650 to 1000 of 
these pupils. Items were selected to yield an appropriate distribution of 
difficulties and to include only items with acceptedly high internal- 
consistency indices" (17).° The final norms are based upon results found 
with 196,000 pupils in forty-four communities in twenty-two states. 
From this point on, the process is one of selection, rejection, addition, 
and refinement of items, and their placement in the scale in respect to 
difficulty level.? The ultimate purpose of this entire process of trial and 
improvement is to develop an instrument which will yield the maximum 
results in either assessment or prediction, or both, using a minimum num- 


ber of test items. 


Population 


At the outset, too, the author of the test must define the group 
for whom the test is intended. For example, in devising measures of 
general intelligence or of "differential aptitudes" (specific areas of ability, 
such as facility with numbers or with words and spatial perception), 
the age range must be determined. Shall the test provide a scale ap- 
propriate for a wide range, as the Stanford-Binet does (2 years through 
adulthood); or for a moderate range, as does the Chicago Test of Primary 
Mental Abilities (ages 11 through 17); or for a narrow range as does 
the Cattell scale for infants and young children (ages two months to 
lidity, and item analysis are discussed in later sections. 


' does not indicate a fixed number. About 500 testees in the 
f available. Much larger numbers are necessary 


* Reliability, va d 

5 The term “large sample s à 
initial exploratory Stages are desirable, i 
for preliminary and final validation. 

? Internal consistency is discussed under 

1 Determining difficulty of items and pl 


Scaling. 


“Meaning of Reliability,” later in this chapter. 
acing them in the scale accordingly is called 
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thirty months)? The age groups must then be subdivided according to 
the principle of stratified sampling, as explained in Chapter 2. 

In the construction of tests of personality, age level is an important 
element. In tests of educational achievement, however, grade range or 
educational level, rather than age level, determines the difficulty and range 
of materials. This is the case also in tests of aptitudes or of proficiencies, 
where age range is less often a consideration than is the level of pro- 
ficiency (or degree of skill) to be tested. In any event, the universe for 
whom the test is intended, and its subdivisions (strata), if any, must be 
determined. Then it is necessary to obtain a large enough sampling of 


this universe to try the test items at each of the several stages of its de- 
velopment. 


Reliability 


Meaning of Reliability. The two essential characteristics of a 

sound test are its reliability and its validity. 
Whenever anything is measured, whether in the physical, biological, or 
behavioral sciences, there is some possibility of chance error. This is true 
of psychological tests as well. Variations of results obtained with the same 


test administered more than once, using the same persons as subjects, or 
within the parts of a test 


given only once, are due not only to chance 
factors, 


à which should be eliminated so far as possible, but also to actual 
differences among the individuals taking the test and to whatever defects 
may be inherent in the instrument itself. 

The reliability of a test is its ability to yield consistent results from 
one set of measures to another; it is the extent to which the obtained test 
Scores are free from such internal defects as will produce errors of meas- 
urement inherent in the items and their standardization. These errors are 
not due to instability of the performances of the testees themselves or 
to chance factors, However, since individuals do not perform with com- 
plete consistency upon all occasions, and since chance cannot be entirely 
eliminated, the actual indexes of reliability obtained for psychological 
tests are the product of the interaction among true individual dif- 


ferences, defects of the instrument, and chance determinants. 


The term reliability has two closely related but somewhat different 
connotations in psychological testing. First, it refers to the extent to 
which a test is internally consistent, that i 
tained throughout the test when administer 
accurately is the test measuring at a particu 

Second, reliability refers to the extent 


s, consistency of results ob- 
ed once. In other words, how 
lar time? 


to which a measuring device 


67 


yields consistent results upon testing and retesting. That is, how depend- 
able is it for predictive purposes? Obviously, if a test does not have a high 
ability when used more than once, it can have but limited 
g an individual's future performance or level of de- 


RELIABILITY 


degree of reli 
value in predictin 


velopment. 
The two aspects of reliability are intimately related; for if a test is not 


highly reliable when used upon a particular occasion (that is, internally 
consistent), it can have little predictive value.5 Since one of the principal 
uses of psychological tests is for prediction and planning for the sub- 
sequent development and performance of individuals, a high degree of 
reliability is a sine qua non of a sound instrument. This statement ap- 
plies also to tests that are to be used for research purposes, since the re- 
sults obtained will not be any sounder than the research instrument it- 
self. 

Conditions Affecting Reliability. Reliability is not an all-or- 
none principle; it is a matter of degree. No test presently available is in 
itself. perfectly reliable; scores for the same individuals, obtained on re- 
peated testings, are not completely stable. Not only are there likely to be 
some different chance determinants in operation at different times, but it 
is quite normal for humans to vary in performance, generally within 
fairly narrow limits, from one occasion to another. Such variation is ex- 
pected quite aside from changes that occur as part of the process of 
growth and development. 

There are several possible sources of variation in performances on a test. 
This aspect of testing will recur frequently in subsequent discussions; but 
for the present, the most common of them may be listed as follows: 

Actual, or “true,” differences among individuals in the general traits or 

general abilities being measured 
Differences in specific abilities required in a particular test; or specific 
disabilities in the functions being tested; for example, reading, manual 
dexterity 
Skill in taking tests; being “test wise,” or the converse 
The "chance" acquisition of a particular piece of knowledge required in 
ing of an unusual word, such as ambergris, 


a test, for example, the meani 
or unusual information, such as the name of the author of a little-known 


work (These would be poor test items.) 
Effects of practice (previous test-taking) or coaching 
Normal or expected fluctuations in performance from time to time 


®In some individual cases, internal inconsistency is due to instability of, or incon- 
sistencies within, the testee, rather than to defects of the test itself. This aspect of test 


interpretation is discussed in Chaper 14- 
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Personal characteristics of the testee: fluctuations of attention, motivation, 
health, energy level, emotional status = 

Physical conditions under which the test is taken: heat, light, ventilation. 

Unpredictable, or chance factors: noise, interference, broken pencil, mis- 
understanding of instructions 

Fortunate guessing of answers 

Competence of examiners and their agreement in scoring 


Test results, ideally, should depend upon the extent to which the test 
measures the first two of these sources of variation; actually, however, co- 
efficients of reliability will be adversely affected by the nonsystematic 
operations of the others. 

Standardization of a test aims to eliminate or reduce to a minimum its 
inherent defects. The conditions of testing and retesting should be as 
nearly optimal and consistent as possible; for reliability is in part a 
consequence of testing conditions, including strict adherence to pre- 
scribed instructions for administering, utilization of practice exercises 
(when included), accurate timing, elimination of noise, and general pro- 
visions for adaptation of individuals to the testing situation. "Though 
minor fluctuations in an individual's performance from day to day can- 
not be eliminated, the reasons for any major fluctuations must be sought 
in the individual himself or in his environmental conditions, if the 
sources of serious discrepancies are to be understood. 

.,,, Methods of Estimating Reliability. Methods of estimating re- 
liability fall into two general classifications: (1) relative reliability and 
(2) absolute reliability. The first of these is generally stated in terms of a 
coefficient of correlation, known as the reliability coefficient. This statistic 
indicates the extent to which individuals in a group maintain relatively 


consistent positions (scores) when two sets of measures are obtained and 
correlated, using 


ASINS the same test or its two equivalent forms. Relative re- 
* 1aty is also reported at times, though infrequently, in terms of analysis 
varia : 
I E rre Br method, absolute reliability, is stated in terms of 
rd error o measurement, which i H Huo 
of a set of obtai > is an estimate of the deviat 


ned scores from their “true scores,” 
Several methods are u 


ai follows: sed to derive the reliability coefficient. These are 
(1) The same fo Be, 
group of indivi EUR of the test may be administered twice to the same 


(2) Two separate bu i 
t equivalent f Es 
to the same indi dizi orms of the test may be administered 
(3) The test items of 


: a single test ar i 
equivalent and separat s subd 


vided into two presumabl 
ely scored sets; p : 


the two sets of scores are cor- 
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related as though they were obtained from two equivalent forms or from 
two testings with the same form. 
Retest RELIABILITY USING A SINGLE Fonw. When individuals are 
retested a number of times, they might undergo some changes as a 
result of repeated measurements, for example, because of practice effects, 
improvement in the skill of taking tests, and in their attitudes toward 
taking a test. In estimating reliability, therefore, it is necessary to limit 
the number of times an individual is examined with the same device. 
Hence, instead of frequent retesting of the same persons, data for a given 
test are obtained by increasing the number of individuals tested rather 
than by increasing the number of measures of each person. 
Administering the identical test form twice has the obvious advantage of 
providing completely equivalent test content on both occasions. This is 
an essential consideration. Furthermore, it is, of course, less difficult to 
develop one form of a test rather than two. $n 
However, this method also has its disadvantages. It is time-consuming. 
The experience of having taken the test the first time might result in some 
learning or improved skills, so that individuals on the second occasion are 
no longer “the same” in all respects as they were on the first. The first and 
second testings should take place within a week or two, in order to 
minimize the possible influences of intervening factors of developmental 
and chance changes. Even so, some psychologists hold that the brief in- 
terval of a week or two does not sufficiently reduce the possible effects of 


recall. i ; 
Yet, although the reliability coeficient might be somewhat too high 
when the same form of the test is used, the probable influence or im- 


portance of recall is not nearly so great as might be expected. In the first 
place, the number of test items in both individual and group tests is so 
large that it is extremely difficult to recall a significant number of them, 
especially when the persons taking the test are working under pressure 
and must rapidly shift attention from one problem to another. 


o Currently, some psychologists prefer to use different terms to designate the several 
methods of estimating test reliability. Instead of speaking of the reliability coefficient and 
of retest reliability when using a single form, some prefer the term “temporal stability,” 
while others prefer “coefficient of stability.” When two equivalent forms of a test are 
used, or when a single form is divided into two equivalent halves, some prefer the term 
"coefficient of equivalence." Because the older term, "reliability coefficient," is the generic 
one and because, so far as the test itself is concerned, we are interested in its de- 
pendability but not in the influences of time as such, we shall use the older term while 
explaining the different methods. The equivalence of two forms or parts is a precondi- 
tion of the test's reliability. Furthermore, the word "stability" is a different way of 
saying "reliability," but with a somewhat different emphasis. "Reliability" however 
estimated, is the preferable word, for we want to know how dependable test results are. 
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The results of two studies will illustrate this point. Thirty children, 
between the ages of g and 11, with IQs ranging from 96 to 114, were 
examined twice each with the 1937 Stanford-Binet scale, Form L. The 
tests were given a week apart to each child by two experienced examiners, 
each child being examined once by each of the psychologists. Upon com- 
pleting the first examination, each child was told the correct answer or 
shown the correct solution to every item he failed, up through what is 
known às the "terminal year." !? The purpose of this procedure was to 
learn how much would be retained from the first test and influence the 
results of the second. The correlation coefficient for the two sets of in- 
telligence quotients was .89; the mean IQ increased by two points (from 
103 to 105); and the total range was only slightly changed (92-117). . 

A second study was made with the Pintner Non-language Intermediate 
Test, using sixty-eight pupils in the sixth grade, ranging in IQ from 76 
to 95. After the first testing had been completed and the booklets col- 
lected, the tester selected, for demonstration and explanation, two or three 
items from each of the subtests, representing the several levels of difficulty. 
For each of these items, the correct answer and the reason for it were 
given. The pupils were encouraged, also, to ask for the answers to any 
items they recalled and were doubtful about. Only a few questions were 
raised. A week after the initial test, the same groups were reexamined 
with the same form. The correlation coefficient for the two sets of intel- 
ligence quotients was .85.!! Numerous investigations have demonstrated 
that, in general, longer intervals between repeated tests will result in 
lowering the reliability coefficient; that is, reliability is in part a function 
of the time elapsed. The following correlation coefficients are representa- 
tive of those found for the stated intervals: 


Immediate retesting (same day or next) -90-.95 
Retesting after 1 year 85 

Retesting after 2-21/, years 80 
Retesting after 5 years 


75-80 
Retesting after 9 years 


-78 
These are rather conservative indexes; some studies find closer cor- 


respondence of scores after appreciable intervals. For example, one study 
reports the following (7): 


te * Terminal year is the age level on the Stanford-Binet Scale at which the testee fails all 
items. 


™ Both of the instances cited are un 
made a practice of askin 


to try to recall specific items 
y has any student been able to do so. 


earning, recall, interference, 
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Age interval r, boys T, girls 


6-16 TI 74 
q-10 19 -76 
8-10 89 83 
9-14 .82 (both sexes) 


In still another research, the following results were obtained with the 


use of the Stanford-Binet scale (4). 


Age at initial testing Interval T 
4 years 6 years 48 
9 years 6 years 87 
11 years 6 years .92 


There are several reasons for the lowering of reliability coefficients as 
the time interval increases. If children are initially tested before 3 years 
of age, and retested after several years, the reliability coefficient will be 
misleadingly low because different functions and dissimilar mental opera- 
tions are being sampled at the two age levels.!? This fact, though true in a 
lesser degree when the first tests are given to children between the ages 
of 4 and 6, may account in part for the coefficient of .73 reported above for 
the group initially tested at the age of 4 with the Stanford-Binet scale. 
Another reason is that individual rates of mental growth and forms of 
mental development do not always progress uniformly, so that the greater 
ul, the greater will be the opportunities for idiosyncracies 
hermore, it is more difficult to motivate and hold 
the attention of young children. Examiner reliability is also a factor. If 
entirely competent examiners are used for both tests, as they should be, 
the discrepancies in scores will be minor. Numerous comparisons made 
in regard to this question have found correlations in the vicinity of 
.85-.9o for the same tests scored by two. examiners, or for two scores de- 
rived by them from two separate examinations. 

Reliability coefficients are affected, of course, by any defects or deficien- 
cies inherent in the test itself in respect to content and scaling, with 
their consequent effects upon the range of scores at each of the several 
age levels. The question here is this: Are the standard deviations of the 
intelligence quotients the same, or nearly so, at all age levels, so that an 
individual who maintains his relative position in the group, over a 
period of time, will have the same IQ rating at all ages? If the disper- 
sions of scores vary appreciably at different age levels, as they do with 
some tests, the reliability coefficients will be adversely affected.18 

It is evident, then, that the reliability coefficient obtained by means of 


m is presented more fully in Chapter 13. 
m is discussed in Chapters 6 and 10. 


the time interva 
to affect test results. Furt 


12 This proble! 
1 This proble: 
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retesting with the same form has its advantages and disadvantages. It 
will depend upon the quality of the instrument, the competence of the 
examiner, the characteristics of the individuals used in the study, and 
the circumstances under which the tests are given. Some of these aspects 
will be discussed in later sections of this chapter. 

Retest RELIABILITY UsiNG EQUIVALENT Forms." Estimating reliability 
by means of this method presents some of the same advantages and dis- 
advantages as does the use of only one form. But using equivalent forms 
also has an advantage of its own: the possible effects of specific practice 
and recall are lessened, since the items in the two versions of the testing 
device are not identical. On the other hand, this method presents an 
additional problem in the construction and standardization of the second 
form, that is, the problem of making both forms truly equivalent. This 


means that both forms should meet all 


of the test's specifications as 
follows: 


1. The number of items should be the same. 
2. The kinds of items in both should be uniform in respect to content, 
operations or traits involved, levels and range of difficulty, and adequacy 
of sampling. 
- The items should be similarly distributed as to difficulty. 
4. Both test forms should have the 
the operations or traits bein 
be shown by intercorrelatio 

total-test scores, 


5. The means and the standard 
closely. 


Qo 


same degree of item homogeneity in 
& measured. The degree of homogeneity may 
ns of each item with subtest scores, or with 


deviations of both forms should correspond 
6. The mechanics of administering and scoring should be uniform. 


but complete uni- 
sary, however, that 


i ical of i i 
at every age group (18, pu. ypical of the relationship found 
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the possible psychological disadvantages already discussed. Another tech- 
nique has, therefore, been devised and is widely used. 

Spuit-HALF RrLiaBiniTY. By means of this technique, which is used 
to find internal consistency, the items in a whole test are divided into 
two halves which should be equivalent or very nearly so. Thus, two scores 
— one for each half-test—are obtained for each individual by administer- 
ing the test only once; and each score is treated as though it represented 
a separate form. 

A prerequisite for using the split-half technique is that the items shall 
ve been arranged in their order of increasing difficulty as determined 
age of individuals in the standardization group who have 
passed each (scaling). This fact rules out what would seem to be the first 
and obvious method; namely, taking the first half of the number of items, 
then the second half, from each of which a separate score is to be de- 
rived. It is obvious that these two parts would not be comparable in levels 
of difficulty, effects of cumulative fatigue, increasing or decreasing con- 
fidence, and external chance factors. In using the split-half method, the 
common practice is to make the division by taking the odd-numbered 
items as one part and the even-numbered as the other. The score is found 


for each individual for e 


then correlated.!5 
Some psychologists have suggest 


ha 
by the percent 


ach half and the two sets of paired scores are 


ed that each half of the whole be 


treated as a short form equivalent to the other half, and that they be 
administered on two different occasions, but within a short time of each 
other. This practice, of course, requires less labor compared with that 
needed to construct two full-length equivalent instruments; but it suffers 
from some of the disadvantages of giving two full-length alternates. The 
use of this method would be justified if the whole test were too long for 
a single occasion and, as a result, produced excessive fatigue in its later 
sections. 

Selecting odd-numbered items as one half of the test and even-numbered 
items as the other half is justified on these grounds: items in most tests 
(as will be seen) are grouped according to type (number sequences, vocab- 
ulary, etc.) and are arranged according to difficulty, from easiest to most 
difficult. Thus, when this systematic arrangement is employed, the odd- 
even procedure yields close approximations to equivalent half-scores, 
because each half-score is based upon the same types of items and the 
same number of each type; and each half-score is based upon items that 
progress in difficulty in approximately the same degree. Consider the first 
ten items of a single type (known as a subtest), verbal analogies, for 


example. Numbers 1, 3, 5, 7, 9, as a group, are about as difficult as num- 


1 This method is also known as odd-even reliability. 
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bers 2, 4, 6, 8, 10—if they are graduated in difficulty from 1 to 10; for 
both the odd-numbered and the even-numbered include items from prac- 
tically the entire range of difficulty represented from numbers 1 to 10.19 

Since the correlation coefficient for the two sets of scores derived by 
this method is based upon subdivisions of the full test, each of which is 
half the length of the whole, a statistical formula (Spearman-Brown) is 
used to correct for the reduced lengths of the subdivisions from which 
the correlated scores have been determined. The reason for this correction 
is that the score of the whole test, being based upon a larger number of 
items, is a more adequate sampling of traits or functions and hence re- 
duces the possible effects of chance solutions and accidental errors. The 
whole test is thus more reliable than its subdivisions; and the correction 
formula is intended to indicate what the reliability of the entire test 
would be, based upon what was found with the part scores. 
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An example will demonstrate how the Spearman-Brown formula 
operates. The generalized formula is: E 
nr 
1 4 (n — 19r 
in which r is the coefficient of reliability obtained between the parts of 
the divided test; 7, is the reliability of the test n times as long as half 


Va 


the original test. 
In the method of odd-even reliability, n is 2, since the original test 


has been divided in two equal parts. Assuming then, that the odd-even 
coefficient r is .80, and substituting the values in the formula, the re- 
liability of the whole test », is found to be .89. This estimated reliability 
coefficient for the test as a whole is the one usually reported in psycho- 
logical research and in test manuals. 

The Spearman-Brown formula may be used to estimate the effect upon 
reliability of a test of a given length if it should be increased by any 
multiple (3 or 4 times, for instance) or decreased by any fraction (14 
or 1⁄4). There isa point of diminishing returns, so to speak, beyond which 
the very small increase in the reliability coefficient, resulting from increase 
in length, does not warrant the extension of a test (see Fig. 4.1). Figure 
4.2 illustrates increase in test reliability as the length of a test is doubled. 
This figure demonstrates what happens when reliability is calculated by 
the split-half method and then corrected by the Spearman-Brown formula 

Another formula that yields nearly the same results as the Spearman 
Brown is available for estimating reliability by the split-half method 


It is the following: 


SD? 


in which SD, and SD, are the standard deviations of the scores of the 
half-tests, and SD, is the standard deviation of the scores of the whole 
test. The obvious advantage of this formula is that it is not necessary to 
correlate the half-tests themselves. 

The split-half method of determining reliability can, under some cir- 
cumstances, yield a coefficient that is somewhat higher than is warranted 
In estimating reliability, an assumption is that the effects of chalice 
factors are uncorrelated and, therefore, will cancel out one another. In 
using the split-half method, however, because both measures are obtained 
on a single occasion, fluctuations caused by temporary conditions within 
the testee or by conditions in the external environment will operate in 
favorably or unfavorably, thus producing a higher 
be found otherwise. 
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Fic. 4.2. Showing the increase in reliability of the 
whole-test scores as a function of the reliability of 
the half-test scores, when the Spearman-Brown for- 
mula is applied 


The split-half technique might also yield higher reliability coefficients 
for predictive purposes than those found by retesting. The reason for this 
is that the former are not affected by the ordinary, day-to-day circum- 
stances that cause normal fluctuations in performance. ; , 

In particular, the split-half method should not be used in estimating 
reliability of a pure "speed test," whose items are of the same degree of 
difficulty throughout and which, therefore, measures only rate of perform- 
ance at the given level of difficulty. Since all items in the test are of equal 
difficulty, an examinee should do as well with any one item as with any 
other. Hence, to measure rates of performance and to differentiate among 
individuals, the time limit and the length of the test should be such that 
no one is able to complete all the items. Under the circumstances, except 
for chance errors in performance, the odd-even correlation should be 
+1.00 (perfect positive), because the test is, presumably, uniform through- 
out, and the psychological function being measured (speed) is operating 
uniformly on all items. It follows that the total scores on the odd- 

, numbered items should equal those on the even-numbered items. One 
test manual, for example, reports an odd-even reliability coefficient of 
-99+, but the manual also reports a coefficient of .88 when the scores of 
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two equivalent forms were used (5). The best procedure is to use the 


test-retest method with a "speed test." 

Reliability of Homogeneous Items. Another method of esti- 
mating reliability of a test, based on a single administration (internal 
consistency) employs the Kuder-Richardson formula, named for its orig- 
inators (15). The appropriate use of this method requires that all items in 
the test should be psychologically homogeneous; that is, every item 
should measure the same factor or a combination of factors in the same 
proportion as every other item. This is called "interitem consistency.” 
A characteristic of interitem consistency is that every item in the test 
has a high correlation with every other one. As in the case of the split- 
half reliability method, this procedure is not affected by an individual's 
variations in performance from time to time. Again, like the split-half 
technique, it should not be used with speed tests. But if the items in the 
test are not highly homogeneous, this method will yield a lower reliability 


coefficient than does the split-half method.'* The commonly used formula 


is the following: 


(Gh) T) 
PI SD; 


in which r is the reliability coefficient of the whole test 
n is the total number of test items 
SD is the standard deviation of the scores of the whole test; 
SD? is, of course, the variance of the total test 
p is the percent of individuals passing each item 


is the percent of individuals not passing each item 
Zpq is the sum of the products of pq (the product is found sep- 


arately for each item). 


THE STANDARD ERROR OF MEASUREMENT. This index of reliability is 
an estimate of the deviation of a set of obtained scores from their "true" 
scores. A "true" score is a value that is free from any chance factors and 
other errors of measurement. An individual's true level presumably re- 
mains constant from one measurement to another, but his obtained scores 
may vary to some extent from time to time. Theoretically, this kind of 
score represents an individual's true level of performance on the test 
being used. The true level of any individual would be represented by the 
average score (mean) of an unlimited number of measurements of that 
person, obtained with the measuring device in question, assuming that 

7 Kuder and Richardson have presented other formulas as well as the one given here. 


Other statisticians, also, have devised modifications and variants. The student who 
wants to study this technique should consult one of the textbooks on statistical methods 


used in psychology. 
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repeated testings have not changed the person. Since the standard error 
of measurement in psychological testing cannot be based upon a large 
number of repeated measurements of the same individuals because of 
practical difficulties, it is necessary to substitute a large number of in- 
dividuals for whom only two sets of measures need be obtained. Thereby, 
the limits of the most probable true scores are found.!$ 

The standard error of measurement is dependent upon the standard 
deviation of the distribution of obtained scores and upon the coefficient 
of reliability of the test from which the distribution of scores was ob- 


tained. The formula for determining the standard error of measurement 
is written: 


SE (meas.) —SD.yi-—r, 


in which SD, is the standard deviation of the distribution of the ob- 
tained scores, and r,, is the reliability coefficient of the test. 
Assume that the standard deviation of a test (SD,) is 12 IQ points and 


that its coefficient of reliability (r,,) is .go. Substituting these values in 


the formula, we find that the standard error of measurement is approx- 
imately --5.8 points. This statistic is interpreted as follows: assuming 
that the test scores are normally distributed and that the "errors of meas- 
urement" are similarly distributed, then approximately 68 percent of the 
obtained scores are within =£3.8 points of the true scores for the person 
measured. Otherwise stated, the odds are 68 out of 100 (or 68 to 32) that 
a particular individual's obtained score is in error by 9.8 points or less. 
Then, using the table of probabilities for standard deviation values, we 
can say, further, that the probabilities are 19 to 1 (95 in 100) that the 
error of measurement will be 7.6 points (twice the standard error of meas- 
urement), or less; and gg to 1 that it will be 9.5 points (214 times the 
standard error of measurement), or less. i 
Obviously, the higher the test's reliability coefficient, the smaller vill 
be the error of measurement, and therefore the greater the dependability 
and the predictive value of the instrument. The standard error of meas- 


urement also provides a basis for estimating the probabilities that unequal 
scores for two or more individuals represent a true statistical difference, 


* The true score as defined is a statistical concept. Whether or not it is consistent 
with a bsychological conception of an individual's true score is a different matter. As- 
sume, for example, that a pupil is capable of performance at, and does occasionally at- 
tain, an IQ level of 140. But for various nonintellectual reasons, his intelligence quo- 
tients, óbtained by means of a large enough number of testings with the same instru- 
ment, average only 130. Psychologically, is his true score a 140 IQ, the level he is really 
capable of reaching, or is it 130 1Q, which represents only his average level of function- 
ing? The answer is a matter of definitions; but to the clinician and the guidance coun- 
selor, the difference is an important consideration. The student should remember that 
what is Statistically “true” may not be the psychological answer; it may be an artifact. 
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or whether the scores fall within a narrow enough range to be regarded 
as probable deviations from the same, or nearly the same, true scores. 

Quite aside from any question of statistical significance, a psychologist 
who is experienced in administering, scoring, and interpreting tests— 
especially those individually administered, as is the Stanford-Binet— 
knows that no psychological significance attaches to a difference of a few 
points in intelligence quotients or percentile ranks. 

As estimates of a test's dependability, the reliability coefficient and the 
standard error of measurement supplement each other. Since the size 
of the latter depends in part upon the former, the coefficient must always 
be found; it is regarded as an essential index of an instrument's value. 
While not all test manuals report standard errors of measurement, it too 
is now regarded as essential. Test authors are expected to include it; for 
the higher the reliability coefficient and the smaller the error of measure- 
ment, the greater will be the confidence attached to judgments and pre- 
dictions based on the test's findings, provided, also, that the instrument is 
sufficiently valid. 

If the error of measurement varies at different age levels or ability 
levels, the manual should report this fact, because sometimes a test meas- 
ures with less error at some levels of the scale than at others. In that 
event, it is possible to estimate at which levels the standard errors of 
measurement are larger or smaller than that for the entire group. On 
the 1937 Stanford-Binet scale, for example, this index is 5.2 points for 
IQs above 130, but only 2.2 points for IQs below 7o. 

When this index is given in terms of test units (raw or weighted Scores), 
it is not possible to compare directly the reliabilities of two or more such 
tests, because their units are rarely comparable. The coefficient of reli- 
ability, therefore, is the index used in making intertest comparisons. 
When the same derived index, as found by two or more different devices, 
is used (for example, the intelligence quotient), then their standard errors 
of measurement may be regarded as one basis of comparison. 

"Table 4.1 illustrates the extent to which the standard error of measure- 
ment depends upon the reliability coefficient and upon the standard 
deviation of obtained scores found for the group of persons tested. 

ANALYSIS OF VARIANCE. As already stated, the degree of reliability of 
a test depends upon the extent to which variations in scores of the group 
are attributable to the true (that is, the actual) individual differences in 
the trait or the ability being measured and upon the extent of inaccuracies 
in measurement. À test is unreliable in proportion to the variations in 
results attributable to test inaccuracy, rather than to actual individual 
differences among members of the group. In group scores, the estimate 
of proportions of variations owing to each of the several sources of both 


80 STANDARDIZATION: PROCEDURES AND RELIABILITY 


error and nonerror variation is technically known as "analysis of vari- 
ance." 19 


In a study of test reliability by this method, the question is: What 


TABLE 4.1 


STANDARD ERRORS OF MEASUREMENT FOR DIFFERENT 
VALUES OF THE RELIABILITY COEFFICIENT AND OF 
THE STANDARD DEVIATION 


Reliability coefficient 


SD .95 90 85 80 75 70 
30 6.7 9.5 11.6 13.4 15.0 16.4 
28 6.3 8.9 10.8 12.5 14.0 15.3 
26 5.8 8.2 10.1 11.6 13.0 14.2 
24 54 7.6 9.3 10.7 12.0 13.1 
22 4.9 7.0 8:5 9.8 11.0 12.0 
20 4.5 6.3 7.7 8.9 10.0 11.0 
18 4.0 5.7 7.0 8.0 9.0 9.9 
16 3.6 5.1 6.2 7.2 8.0 8.8 
14 3.1 4.4 5.4 6.3 7.0 gu 
12 2.7 3.8 4.6 54 6.0 6.6 
10 2.2 3.2 3.9 4.5 5.0 5.5 
8 1.8 2.5 3.1 3.6 4.0 4.4 
6 1.3 1.9 2.3 2.7 3.0 3.3 
4 9 1.3 1.5 1.8 2.0 2.2 
2 4 6 8 9 1.0 11 


Source: The Psychological Corporation. 


factors may be important, and to what extent, in producing the obtained 
differences of scores on two applications of the identical test (or of equiv- 
alent forms) to the same group of persons? First, because individuals 
differ in any population sample, the analysis should estimate the extent 
to which obtained differences in scores are due to true differences in the 
functions being measured. Second, if there is some general improvement 
of scores on the second test, it would be necessary to estimate the practice 
effect. Third, to what extent are differences due to defects in the test 
that will produce guessing and other chance responses (defects in con- 


te f i 1 E " 
nt); or to errors in scoring; or to the testing environment? Because not 
? Variance is defined a 
s thi pare 
the group. A measure o € mean of the squared deviations from the mean score of 


of a group vary from fu M a index of the extent to which individual scores 
average i H fex 
the square of the standard E (SD) Score. Variance is the statistical term for 
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all elements contributing to variations in scores can be accounted for, 
those that cannot be separately isolated and analyzed are called "residual 


factors." 
Analysis of variance is not ordinarily used as a method of reporting 


reliability of a psychological test. It is, however, an important research 


tool.*° 
Factors Affecting Reliability Estimates 


Range of Ages. Ifa reliability coefficient is found with a group 
that has a relatively small variation of the trait or ability being measured, 
the coefficient will be relatively low. If the group has a wider range, the 
coefficient will be higher (see Fig. 4-3). Thus, a test having high reliability 
for a widely varying group does not necessarily have equal reliability for 
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Fic. 4.3. Curve showing increase in 
test reliability as variability of a group 
increases 


o are significantly more homogeneous. One of the 


a group of persons wh | 
s the nature of the correlation process and the 


reasons for this fact i 
elements in the correlation formula. 

For illustrative purposes, suppose there is a group that is completely 
in chronological age. Assume that everyone in the group 
ears of age. If they are an adequately representative sample 
the range in test scores will be from extremely low to 


homogeneous 
is exactly 10 y 
of all 10-year-olds, 


become familiar with this technique should consult one of the 


æ Students who want to bec i 
standard textbooks on statistical methods in psychology. 
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extremely high. In this instance, because there is no deviation (or range) 
at all in one of the measures (chronological age), the correlation coefficient 
for the two variables (test score and CA) will be zero.?! Such an extreme 
instance rarely occurs, but it does demonstrate that when there are pos- 
sibilities for wide variations in one measure (in this instance, the test 
score) and very restricted possibilities in the other (in this instance, the 
CA) the coefficient is lowered. 


If the age range were two years instead of one, the coefficient of corre- 
tion would still be low, but not zero, because in ge 
of the older group tend to have higher test scores th 
younger. But since there is a wid 
and overlapping of ca 
will be low. 


» neral the members 


an do those in the 
e range of capacity within each group, 
pacity between the two age groups, the coefficient 


A correlation coefficient reflects the group trends of the measures. As 


persons increase in age, mental capacity increases until maximum de- 
velopment is reached 


But since there are wide differenc 
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TABLE 4.2 


Raw Scores AND RANKS OF STUDENTS ON 
Two Forms OF AN ARITHMETIC TEST 


A ————————————————————— 


Form X Form Y 
Student Score Rank Score Rank 
A 90 1 88 2 
B 87 2 89 1 
[e] 83 3 76 5 
D 78 4 77 4 
E 72 5 80 3 
F 70 6 65 7 
G 68 7 64 8 
H 65 8 67 6 
I 60 9 53 10 
J 54 10 57 9 
K 51 11 49 11 
L 47 12 45 14 
M 46 13 48 12 
N 43 14 47 13 
(0) 39 15 44 15 
P 38 16 42 16 
Q 32 17 39 17 
R 30 18 34 20 
s 29 19 37 18 
T 25 20 36 19 


Source: Test Service Bulletin, The Psychological Corpora- 
tion, May 1952, no. 44- 


he shifts is greatly emphasized. Whereas in the larger 
hange in rank from third to fifth represented only a 
laces out of twenty), his shift of two places in rank 
in the smaller top group is a 40 percent change (two places out of five). 
When the entire twenty represent the group on which we estimate the 
reliability of the arithmetic test, going from third on form X to fifth on 
form Y still leaves the student as one of the best in this population. If, on 
the other hand, reliability is being estimated only on the group consisting 
of the top five students, going from third to fifth means dropping from the 
middle to the bottom of this population—a radical change. A coefficient, 
if computed for just these five cases, is .50 (rho).?? 

Note that it is not the smaller number of cases which brings about the 


the importance of t 
group Student C's c 
10 percent shift (two p 


? Rho represents the "rank-order" correlation coefficient. It approximates closely the 


product-moment coefficient r. 
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lower coefficient. It is the narrower range of talent which is "rad 
coefficient based on five cases as widespread as the twenty (e.g. Pupils \, * 
J. O, and T, who rank first, fifth, tenth, fifteenth, and twentieth respectively 


on form X), would be at least as large as the coefficient based on all twenty 
students.?3 [rho = +1.00] 


Furthermore, when the variation among testees is narrow, the correla- 
tion between two sets of scores may also be lowered by chance and minor 
psychological factors. Since individuals in such a group are clozely aus: 
tered—that is, their true differences are small—the changes in scores and 
relative positions produced by extraneous factors are more significant 
than they would be in a widely divergent group. , 

This illustration makes clear the fact that reliability coefficients of 
a given test may vary as the composition of the tested group changes, even 
though the performances of the testees themselves are unchanged. Thus, 
reliability data may show that a test discriminates satisfactorily over a 
wide range of the trait or capacity measured; but reliability may still 
be inadequate where finer and more precise discriminations are neces- 
Sary among individuals who vary within a narrow range. i 

The practical significance of range, and hence of ability, is this: in 
standardizing a test, its author must determine reliability with a group 
that is similar in average level of ability and in variations of scores to 
the group with whom the test is to be used. The examiner should select 


an instrument that, among other things, provides reliability data based 
upon a sampling of persons who resemble closely the group of individuals 
he desires to test. 

The Time Interval Betw. 
mates are based u 


alent forms adm 


een Testings. When reliability esti- 
n scores of two equiv- 
the results are relatively 
condition and attitudes, 
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the physical conditions of th 


apart, it is unlikely that 
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WAYS possible that s Ehe Psychological Corporation, 2, no. 44. 
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at a single sitting, or in a single day, are most likely to estimate best the 
consistency of the instrument itself, they do not indicate stability of per- 
formance over a period of time as well as do coefficients obtained by the 
test-retest method, using a time interval. Conversely, the test-retest method 
is the more likely to underestimate the internal consistency of a test, 
because factors extraneous to it may affect the scores dissimilarly. The 
extent to which the accuracy of a test is underestimated by the test-retest 
method will depend upon the degree to which effective influencing con- 
ditions are inconsistent. If the time interval has been quite long, espe- 
cially in the case of young children—perhaps three months or more—an 
individual's retest results may be influenced by peculiarities of his growth 
tempo, or by other more or less enduring conditions such as emotional 
experiences, which may affect persons of any age. 

The Effects of Practice and Learning. Such effects will depend 
upon the content of the test, the length of the interval, and upon the 
examinee's experiences during the interval. For example, if some months 
have elapsed between two administrations of an educational achievement 
test, different pupils may have had different amounts and qualities of 
instruction during the period. The retest scores would, in part, reflect 
this difference; thus the correlation coefficient would not be solely a re- 
liability coefficient. Or, in the case of a personality test, therapy or coun- 
seling may have modified an individual's attitudes, values, and behavior 
sufficiently to produce significant differences in test-retest results. 

Reliability of Subtests. It has already been explained that, 
other factors being equal, the reliability of a test increases with increase 
in length, although not in direct proportion. This principle applies to 
those scales that consist of several different subtests, each of which utilizes 
a different kind of content. Nearly all group tests are of this kind, as are 
some of the individual scales (the Wechsler, for example). For these in- 
struments, the total-test reliability is higher than that for each of the 
subtests. It is erroneous, therefore, to assume that the reliability coefficient 
for the whole may be applied to a part. For example, the Wechsler Intel- 
ligence Scale for Children yielded a full scale (nine subtests) reliability 
of .92 for a group of 200 children 714 years of age, the coefficient having 
been calculated by means of the split-half technique. Yet, the reliabilities 
of the individual subtests for the same group of children ranged from a 
low of .59 to a high of .84 (21). It is obvious that the total score is a more 
dependable index of the abilities or the traits being measured than is 
any subtest score. 

Consistency of Scorers. This is a factor in evaluating test re- 
liability. Some tests (such as the Stanford-Binet and, in particular, pro- 
jective techniques) are not entirely objective in scoring, because the 
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examiner at times finds it necessary to judge the correctness or quality of 
responses. For tests such as these, it is necessary to know the extent of 
agreement in scoring among competent psychologists who have scored 
the same sets of responses. Test authors usually report such data in their 
manuals; and in addition, other psychologists will have carried out and 
reported studies on this problem. Lack of agreement among scorers will 
adversely affect the reliability findings. 

Which Method of Estimating Reliability Is Preferable? An 
answer to this question depends upon the problem at hand. Psychologists 
and educators want to know (1) the internal consistency of a test and 
(2) the predictive value of a test when it is subject only to the minor or 
accidental changes in conditions from day to day, rather than to funda- 


Psychologists Studying effects on mental operations and organization 
produced by psychotherapy, personality difficulties and disorders, serious 
emotional experiences, or basic personality changes with time use stand- 
ardized tests for their Purposes. Obviously, if their findings are to be 
meaningful, the testing devices must be reliable 25 (12). 
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Definition 


i est measures 
An index of validity shows the degree to bia alee peta oa 
urports to measure, when compared. with accep has been eval- 

$ i t has 

construction and use of a test imply that the pope t of the 
uated against criteria regarded by experts as atiiscrf- validatol 
traits to be measured by the test. Selection of E aig redonda. 
criteria and demonstration of an appropriate degree OLN; 
mental in psychological and educational testing. 


i , ; iable. 
The first. essentia] quality of a valid test is that it be highly reliable 
Ifa reliability coefficient 


of a test is zero, it will not be correlated with 
anything. A test that yields inconsistent results (low reliability) cannot 
correlate well with a measure of another variable; in this instance, a 
criterion. 


what it P 
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Types of Validity 


Operational and Predictive Validity. The terms "operational" 
and “operationalism” have considerable currency in contemporary psy- 
chological writings. "Operational" simply means that something pertains 
to an operation or a procedure. "Operationalism" is the principle, or 
doctrine, that propositions, concepts, constructs, and theories are given 
their meaning, in the last analysis, by the methods of observation or in- 
vestigation used to arrive at them; that they have no other meaning than 
is yielded by the procedures or operations by which the things or processes 
to which they refer are known. “Prediction” is used here in its ordinary 
meaning: to forecast. 

It is useful to recognize these two general types of validity, although 
they are not mutually exclusive. This is especially true at the present 
time, because much more attention is now concentrated on several forms 
of operational validity that are discussed in the following pages (for ex- 
ample, construct validity and factorial validity) than used to be the case. 

"Operational validity" means that the tasks required by the test are 
adequate for the measurement and evaluation of certain defined psycho- 
logical activities. For example, The Seashore Measures of Musical Talent 
are tests of certain essential auditory aspects of musical talent, but not of 
all aspects of "musical talent," which psychologically involves much more. 
In so far as the Seashore tests differentiate correctly among persons in 
regard to the specified auditory processes, they are operationally valid. 

On the other hand, these measures have predictive validity to the 
extent that they are efficient in forecasting subsequent development of 
various degrees of skill and competence in the several aspects of music. 
Thus, the predictive validity of a test is the extent to which it is efficient 
in forecasting and differentiating behavior or performance in a specified 
area under actual working and living conditions. 

Numerous other examples can be cited to illustrate the difference be- 
tween predictive and operational validity. A pegboard test (placing small 
metal pegs into a perforated board) may measure manual and digital 
dexterity (operational), but it might be only slightly useful in predicting 
mechanical ability. Again, a word and number checking test may be 
quité satisfactory as a measure of perception of details (operational), but 
it might have limited value in predicting success as a secretary. A test of 
the four fundamental arithmetical processes might be valid for measuring 
proficiency in these (operational validity), but it might have very little 
value in predicting ability to learn algebra at the ninth-grade level. 

Predictive validity is dependent, at least in part, upon the operational 
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validity of the test. The reason is that the psychological operations re- 
quired by the test were included because they were found essential for 
testing in certain actual situations. Hence, if the psychological opera- 
tions, or the information, or the specific skills are not measured validly, 
predictions of later performance will be adversely affected. 

Most tests must have predictive validity, as should be evident from 
the uses to which they are put. Possible exceptions are tests of educa- 
tional achievement, when they are used solely for measuring specific 
learning during a specific period of time, without regard to future edu- 
cational planning for the individuals examined. It is a fact, however, that 
results of educational achievement tests are most often used not only to 
determine an individual’s level and quality of achievement in a subject 
of study, at a specified time, but also to plan his subsequent education. 

A Face Validity. This is a term used to characterize test ma- 
terials that appear to measure what the test's author desires to measure. 
That is, the test contains items that seem to be related to the variable 
being measured. The content of the test seems to be relevant to its stated 
purpose; and there is no further effort to confirm the assumption ob- 
jectively. This, too, is an operational conception of validity, based upon 
subjective judgment. 

This form of validity, now rejected, has been disparaged since more 
sophisticated procedures have been devised. As a matter of fact, however, 
face validity in the earlier days of test development was the criterion used 
by many competent psychologists as a first step. Validation of their test 
ee case value was not as capricious, haphazard, or casual as 
LEE ve said. On the contrary, their content was based upon 
E ver psychological knowledge and insight could then be utilized.! 

ace validity was claimed most often with tests of educational achieve- 
ment and of personality, and to a lesser extent with tests of specific 


Tests have been validated at face value when there was urgent nen 
for them, for example, when psychologists worked under pressure 1n ihe 
armed forces, or in the early stages of developing tests for use in selecting 
industrial and business personnel. Except in such instances, a claim of 
face validity is not sufficient to warrant the use of a test. 

Tests of personality traits present an especially difficult problem in 
validation. Often in the past—and to some extent currently—authors of 
personality tests have used face validity. In this category, the sounder 
instruments, however, are validated against actual forms of behavior of 
2 Jes of individuals and against clinical diagnoses. However, these 

dsena PE do nor have in mind the incompetents, charlatans, and popularizers 
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validating criteria present difficulties because they are themselves based 
to an appreciable degree on the subjective judgments of the specialists 
making the diagnoses and the evaluations of behavior. These problems 
are discussed in detail in subsequent chapters. 

Content Validity. As the name indicates, this form of validity 
is estimated by evaluating the relevance of the test items, individually 
and as a whole. Each item should be a sampling of the knowledge or per- 
formance which the test purports to measure. Taken collectively, the 
items should constitute a representative sample of the variable to be 
tested. At the same time, it is essential that the content not be com- 
pounded by introducing irrelevant problems and materials. For example 
a test of spelling ability should not place any weight on rates of Nune 
A test of the four fundamental arithmetical processes should not be 
“contaminated” with demands upon reading ability. On the other hand 
a test of ability to solve arithmetical problems must include reading 
comprehension. 

Content validity is most appropriately applied only to tests of pro- 
ficiency and of educational achievement, although such validity may be 
and should be supplemented by several types of statistical analysis. Valid- 
ity of content, however, should not depend upon the subjective judg- 
ment of only one specialist. It should be based upon careful analyses, by 
several specialists, of instructional objectives and of the actual subject 
matter studied. For example, in constructing a test of American history 
the specialists will examine textbooks and courses of study that they Be: 
lieve to be representative; they will determine which topics and facts are 
most significant, and what their relative weights in the whole test should 
be. Representative items will then be devised in collaboration with some- 
one who is an expert in the writing of test items and in technical pro- 
cedures of test construction.? 

The validating process should not stop there. Statistical analyses should 
follow, for the purpose of: (1) determining which items discriminate best 
between individuals at the upper and the lower levels of performance; 
(2) determining the percent that answers each item correctly; (3) dee 
mining significance of increases in average scores from one school grade 
to the next; (4) determining for each item, or for each division (subtest) 
its correlation with educational progress and with general educational 


performance (school marks). Thus, content validation rests first upon an 


for example, established a test bureau that was given the responsi- 
e tests in each of several areas of study. Students were re- 
fore they were permitted to take courses in the upper divi- 
f this bureau consisted of one or more specialists in 
to be constructed, plus a number of specialists in 


? One university, 
bility of preparing objectiv: 
quired to pass these tests be 
sion of the university. The staff o 
each area for which measures were 


test construction. 
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expert analysis of the materials to be sampled (the variable), and second 
upon the use of available statistical procedures to refine the original 
selection of items (16). 

Factorial Validity. This method utilizes factor analysis tech- 
niques that are not within the scope of the present discussion. Factor 
analysis theory, however, is discussed in Chapter 7 in connection with 
theories of intelligence. Yet since factorial validation is a method used 
with some tests, students should be familiar with the general nature of 
the concept. 

Most tests of mental ability and of personality sample a composite of 
behavior or of ability, such as verbal knowledge and facility, number 
facility and quantitative reasoning, memory span, and concept formation. 
Factor analysts maintain that these and others, especially when repre- 
sented by a single composite index (such as mental age or intelligence 
quotient), are not "functional unities." Factorial analysts insist that they 
are not measures of a “pure” ability, that is, one type of ability uncom- 
plicated by others. Thus, according to this theory, a test is said to have 
high factorial validity if it is a measure of one functional unity (for 
example, word knowledge) to the exclusion of other elements, aS far 
as possible. The factorial process aims to identify, by the method of inter- 
correlations and further statistical analysis, a list of functional unities 
(also called “primary mental abilities”) within a test and the weight 
contributed by each of these to total performance on the test. The ulti- 
mate goal is to devise tests, each of which will measure only one func- 
tional unity and be relatively independent of others (that is, show quite 
low intercorrelations). Such pure tests would then be used singly; or they 
might be used as subtests in a comprehensive measuring instrument. 
1/72) ad NN ina eo is scored and rated independently for 

These functional e ine sl S papara viho wenn l 
tions among a number E w munen by analysing tie mtercorre s 
example, ten separate era iM lie Joe ag aa a a 
ministered to a large n "à m et 
would be correlated with "ke pm à DM each sep arate test score 
lation coefficients.3 Ins wate i the Cin a sips care 
the tests they A a MA : a : E rael, a 
tional unities, such as rur fa PEB ISSUE YE nio clusters, or func 
Span. But analysis by ins Cien iy "al dem OR Caeci eine 
Statistical techniques E been devil ae opu Cmdm 

oped to analyze the table of correla- 
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tion coefficients and in this way identify the common factors that will 
account for the obtained coefficients.* 

Since the original number of tests or types of materials will usually 
group themselves into clusters (factors), the number of factors will be 
smaller than the original number of tests. Factor analysis, therefore, is 
intended to reduce the number of variables, or test categories, needed to 
represent an individual's abilities or traits for specified purposes. It is a 
technique that yields clusters of correlations from which one can infer 
the existence of underlying variables. The psychologist must make the 
inference of their existence primarily on the basis of his psychological 
insights into the intellectual or behavioral processes involved in the tests 
included within each cluster of correlations. 

Factorial validity, therefore, is determined by the weights (called “‘load- 
ings”) contributed to the total-test scores by each of the derived factors, 
as it is also by their relative independence of one another (low inter- 
correlations). Having deduced that certain apparently separate mental 
Operations do, in fact, correlate well and tend to form a cluster, the 
tester may then devise more precise and more restricted sets of items for 
each of the factors and thus repeat the factorial analysis several times; 
He will thereby be able to construct a test in which each part (subtest) 
Will be relatively distinct from the others, so far as kinds of mental activ- 
ity are concerned. 

The merit of factorial validity, in any given instance, will depend 
upon the appropriateness of the statistical procedure and upon the tester's 
initial psychological insights in selecting items, and also-in the interpre- 
tation of his findings. 

If validation stops at factorial validity, operational validity has been 
inferred, If the test has been validated against criteria of later perform- 
ance in working situations, after factorial validity has been calculated, 
then we also have predictive validation. The principal contribution of 
factorial validation is this: instead of validating the total, undifferen- 
tiated instrument against external criteria, an effort is made to identify 
the component psychological elements and to establish their relative 
independence, and finally to correlate these elements separately against 
€xternal criteria. 

, Such analysis into psychological unities, or elements, is of value when 
individuals are to be selected for specialized work or study and their 
performance predicted therein. For example, since mechanical aptitude 
is not a simple, unitary skill, it is valuable to be able to identify which 


“An explanation of techniques of factorial analysis can be found in some of the 
textbooks on statistical methods in psychology. For a more comprehensive and detailed 
Presentation, see (12). 
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psychological elements have most predictive value for a — type 
of work. Mechanical work may involve a high degree of spatia percep 
tion in one situation but not in another; similarly with manual precision 
and speed, or comprehension of mechanical principles. Also, wen than 
average intelligence is desirable for the practice of both law ai engi- 
neering. In the former, word knowledge and verbal concept formation 
are the more significant; in the latter, spatial perception and quantitative 
reasoning are more significant than word knowledge. Factorial analysis 
can assist in identifying the more restricted and immediately relevant 
aspects of ability required in a given occupation or activity. : 

Construct Validity. This is one of the newer operational con: 
ceptions of validity; it should be understood, first, in terms of the meaning 
of "construct." There are several parts to the dictionary's definition of the 
term: (1) a synthesis or ordering of elements or factors; (2) an object 
of thought which arises by synthesis or ordering of terms; (3) a product of 
the uniting of immaterial elements. The Englishes' dictionary of psychol- 
ogy states that a construct is “a concept formally proposed with definition 
and limits explicitly related to empirical data" (10). If this criterion ot 
validity is applied in the development of a test, these definitions signify 
that the instrument's validity is to be judged by the extent to which it 
measures or assesses the psychological processes (mental operations) or 
the personality traits as defined and analyzed by the author of the test. 
Construct validity depends upon the degree to which the test items in- 
dividually and collectively sample the range or class of activities or 
traits, as defined by the mental process or the personality trait being 
tested. 

Construct validity differs from face and content validities in that under 
the first of these, each process or trait to be tested must be analyzed and 
made explicit, and each item in the test must belong to the process or 
trait in question. Thus, to devise a test of manual skill, a psychologist 


must learn, through analysis of the performances involved, what kinds 
of activities constitute manual skill. 


Construct validity also differs from the others in so far as this concep- 


cen or are being subjected to refinement by means of 
factorial analysis or predictive validation in situations where the char- 
acteristic is in operation, 


TYPES OF VALIDITY 95 


The Lorge-Thorndike intelligence tests include several types of items 
which, over an appreciable period of years, have been accepted as sam- 
pling and measuring certain mental activities, such as interpretation of 
symbols, using words, numbers, and more or less abstract diagrammatical 
forms. If a self-rating personality inventory has been based on construct 
validity, each trait to be assessed has been analyzed (on the basis of actual 
cases) to determine the kinds of activities which exemplify the trait, and 
relevant items have been devised. For instance, in developing an inventory 
to assess the trait ("dimension") security-insecurity, a psychologist must 
answer these questions: What is the definition of the syndrome called 
security? What are the various specific forms of behavior manifested by 
persons who are diagnosed as "insecure" by counselors and clinicians? 
What are the opposites of these as manifested by individuals who are 
regarded as "secure"? Do the items in this inventory relate definitely and 
comprehensively enough to these specific forms of activity? Do the items, 
individually and collectively, differentiate between the markedly secure 
and markedly insecure? Do the test items also differentiate reasonably 
well among the several degrees of the trait, excluding the extreme groups? 

If a test has a satisfactory degree of construct validity, the scores ob- 
tained with it should indicate the status of the testees at the time of 
examination. The psychologist may then interpret the obtained results 
primarily in terms of the mental processes or of the trait as defined and 
Sampled, and with reference to the population sample upon whom the 
instrument has been standardized. 

For example, if a sound “construct valid” test of mental ability has 
been administered, the findings can answer such questions as the follow- 
ing: How well does this individual now perform when dealing with 
abstract verbal concepts, standardized on a representative sampling of 
pupils in grades x, y, z? How well does this pupil perform on problems 
requiring organization and interpretation of quantitative symbols, stand- 
ardized on a representative sampling of pupils in the same grades? At 
What grade or age level does the pupil perform when dealing with non- 
Verbal forms requiring analysis and integration? 

When a test of mechanical comprehension is being used, if it has con- 
Struct validity, it will be properly scaled; it will be sufficiently compre- 
hensive to serve its specified purpose. Its findings should, therefore, be 
able to Supply an answer to this question: How well is each testee able to 
deal with a given range of problems representing mechanical compre- 
hension at the specified levels of difficulty and in comparison with the 
Performance of the standardization population? 

In the case of à security-insecurity inventory that is said to have con- 
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struct validity, its findings should answer this question: What degree of 
this trait is represented by a given range of scores if the norms were 
derived from scores based on ratings, by qualified observers, of responses 
in relevant situations? a . 

This type of operational validity is employed, also, in devising aptitude 
tests in music and the graphic arts. These tests should be able to answer 
specific questions about an individual's capacities in the several psycho- 
logical aspects on which learning and performance depend. These aspects 
have emerged from analysis of each of these arts by qualified. artists 
themselves, in conjunction with psychologists who have used these as well 
as their own analyses of experimental tests, as in the case of the Seashore 
Measures of Musical Talent, and the Wing Standardized Tests of Musical 
Intelligence (see Chapter 19). : 

"Tests of intelligence were initially based on certain ideas concerning 
which mental operations were most important in this complex mental 
process. These concepts were subjected to experimental trial, to further 
psychological and statistical analysis, and to practical use. As a conse- 
quence, the types of test materials underwent refinement in all instances 
and extension in some respects, but restriction or elimination in others. 
These results will become apparent in later chapters in which the tests 
and their contents are described and discussed. 


Concurrent Validity. This is also one of the newer terms. Orig- 
inally, psychologists spoke of validation with other tests, validation with 
a proficiency rating, validation with school grades, or, in the case of a 
personality test, validation with a recent diagnosis. At present, psycholo- 
gists prefer the term “concurrent validity” to indicate the process of vali- 
dating the new test by correlating it, or otherwise comparing it for agree- 
ment, with some present source of information. This source of informa- 


tion might have been obtained shortly before or very shortly after the 
new test was given. 


New criteria of validity are not involved. For example, in standardizing 

à new group test of intelligence, its scores will be correlated with those on 

a sound, standardized achievement test, both being given within a short 

time of each other. Or, as frequently happens, the scores of the group test 

will be correlated with those obtained on the Stanford-Binet scale for a 
manageable but representative number of individuals.5 

Cross Validation. This term refers to the process of validating 

a test by using a population sample other than the one on which the in- 

strument was originally standardized. The reason for using this method is 


_ ""Manageable" means that the number of 
1s necessarily much 


individuals tested with the Stanford-Binet 
is individually adm 


smaller than that to whom the Group test is given, since the former 
inistered and time consuming. 
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that at times the original validity data may be spuriously high or low 
because of chance factors that produce a higher or lower correlation than 
is warranted. As a matter of fact, however, once a test is put to use in a 
variety of situations and by many different persons, it is being constantly 
cross-validated; and if it does not prove to have high enough value, its use 
will be or should be discontinued. It is sounder practice, however, to cross- 
validate a test before making it available for general use. 


Validating Criteria 


The problems of selecting and utilizing satisfactory validating 
criteria vary with the several types of tests. p er 
In constructing tests of intelligence (general ability), a common practice 
is to use several of the following criteria: school marks, teachers’ judg- 
ments of an individual's abilities, cumulative scholastic averages Ces E 
period of several years, number of school grades completed, chronological 
age, known greups, and other well-validated tests. The reasons for using 
these criteria are that: 


Scholastic records provide evidence of mental ability even though affected 


by factors other than intellectual ability. E- 
"Teachers are in a position to evaluate individual ability with some validity, 


; i are a to make 
because they observe their pupils over a long period and are able 


interpupil comparisons. i 
Ics k 1mates o 
Cumulative scholastic averages are more valid than marks or estimates of 


i i j nts of performance 
à single teacher, because they represent combined judgme pP 


Over a longer period of time. a" 
On the whole, the more able persons complete more formal education; 


they reach higher levels in school and college. . $ ; 
As individuals grow older, their levels of intelligence increase until adult 


maximum is reached. 
Definitely known groups, 3 
slow learning, and mentally deficient, s 


levels of performance on a valid test. p 
Å A tt should correlate well with another instrument of proved 


validity that is intended for the same purpose. 


such as the gifted, somewhat superior, average, 
hould show significantly different 


The principal criteria in standardizing tests of isto E er (such 
as mechanical, musical) are marks in training os an "ads Me 
9f known groups possessing the aptitude in e co ees : 
of known groups would be those working efficiently e vera 
levels of à mechanical occupation and those in me anica oan 
tions. It is highly desirable, of course, to use degree of success of actual 
Performance in the vocation as an ultimate criterion. 


98 STANDARDIZATION: VALIDITY 


When criteria of actual performance on the job are used, they are 
usually ratings by supervisors, rate of production, and evaluation of the 
quality of the product. In managerial positions and in the professions— 
teaching, law, medicine, engineering—valid and uniform criteria of per- 
formance beyond the learning period are much more difficult, or at times 
impossible, to obtain. In standardizing an aptitude test, therefore, the 
most frequently employed criteria are grades and ratings in the training 
courses; that is, criteria of capacity to learn the given skill or the pro- 
fessional subject matter, since aptitude tests are used largely to select 
individuals to be educated in specified areas. Their use in employee 
selection, however, is not inconsiderable and has been increasing. 

In personnel work, in business and industry, where specialized tests are 
used to select individuals for specific jobs, it is possible, indeed essential, 
to use actual production records or performance ratings as criteria of test 
validity. If, for example, a personnel department wants to know whether 
certain measures will identify the potentially best stenographers, the tests 
might be administered (1) to a group of employees of several quality 
levels to estimate the instrument's differentiating efficiency and (2) to 
newly employed personnel whose performance records, after an adequate 
period, would be correlated and otherwise analyzed against their test 
Scores. 

Devices used for this purpose are of two kinds. There are, first, the 
aptitude tests that sample the psychological processes, or operations, in- 
volved in the job, such as rate of tapping, digital precision, perception, 
and rate of learning new materials. Performance on such tests would be 
indicative of one's capacity to develop the skills or to acquire the knowl- 
edge needed for the work in question. 

The content of the second kind of test consists of items which are 
sn A prs work is be performed on the job. These are called 
curacy fà ae P re e, some clerical jobs require speed and ac- 

metical computations,® or knowledge of punctua- 


tion a i ili i 
ic ere or ability to read and interpret graphs. A test of 
nc 1€ proficiency should includ i 

Kd au cy e rate and accuracy of copying and 
Tests of educational achievemen 
ucational criteria already discuss 
Criteria used in esti 
Projectives—were 
addition to these, 
include matching 


ed t are ultimately validated against the 
1 ed under content validity. 

imating validity of personality tests—inventories and 
briefly presented in the discussion of face validity. In 
several other techniques of validation are in use. These 
eer eee ot of test results with case histories; com- 
creasing use of a varie. uh ue marble be rapidly disappearing with the in- 
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paring results obtained after therapy with those obtained before; and 
temporary experimentally produced changes, when individuals are sub- 
jected to certain predetermined conditions in the laboratory. 

Comparison of Types of Validation. As already stated, the 
basic purpose of any psychological or educational test is to determine 
each individual's present status, or to predict his future status with regard 
to specified types of mental functioning (including motor skills), per- 
sonality traits, or learning. 

A test's content validity indicates the extent to which it yields an ade- 
quate measure of achievement or performance in certain specified areas; 
as in measuring learning in a school subject, or in testing proficiency in 
the simple arithmetical calculations required in a clerical job. 

A test’s construct validity (including factorial validity) will indicate the 
Psychological operations on the basis of which one's test performance may 
be explained and evaluated. 

A test's concurrent validity indicates the extent of its agreement with 
Other present criteria measuring similar psychological operations or traits. 

It should be clear that content, construct, and concurrent validities are, 
in fact, evaluations of the extent to which the device estimates an indi- 
Vidual’s status at the time the test was administered. From the viewpoint 
of applied psychology, every test, whatever the type, must ultimately be 
shown to have predictive validity. Psychologists in applied fields, edu- 
Cators, and employers are usually interested in knowing present status as 
an indication of what may be expected subsequently. If a test of clerical 
Proficiency or of arithmetical ability is given to a group, it is for the 
Purpose of selecting those whose present status indicates satisfactory 
Promise for future performance on the job. If a test of general intelligence 
55 administered to school children or to college students, the results will be 
used for educational or vocational guidance at that time, or with a view 
to the future, 

When testing on a large scale is undertaken, as in the armed forces, for 
Purposes of screening on the basis of present status, determination of 
Status is only the first step in eliminating those individuals whose test 
Tatings Show little or no promise. At the same time, the test serves to 
Identify those who show adequate promise in the area tested. . 

If tests of intelligence, specific aptitudes, or personality traits are ad- 
ministered in a clinic, their purpose is not only to determine the person's 
Status at that time, but to decide upon procedures to be followed with that 
dividual in order to handle effectively the problem that brought him 
there, 

A possible exception to the necessity of predictive validity is the edu- 
Cational achievement test administered at the end of a course of instruc- 
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tion for the purpose of measuring how much has been learned and for 
assigning grades. But this is by no means always the sole purpose served by 
achievement tests; for performance on them often provides the basis for 
selecting later courses of study. Thus, the predictive validity of a test is its 
most important characteristic; for, if high enough, it will indicate present 
status as well as being predictive. 

Some theoretical psychologists would not place as much emphasis upon 
predictive validity. Their position is that when tests are used. solely for 
research purposes, they are not concerned with prediction, but rather with 
the covariation of current variables. Although this is true, the research 


studies should still be of value in predicting future situations and findings 
under similar conditions. 


Methods of Calculating Validity 


Simple Correlation. The most frequently used technique of 
estimating validity is to correlate test scores with each criterion. A co- 
efficient of a particular size cannot be specified as signifying, or not sig- 
nifying, a satisfactory degree of validity. Whenever a coefficient is posi- 
tive and has a small probability (or range) of error, it has some value. In 
some instances, coefficients of only +.25 have proved useful. In most cases, 
however, coefficients should be larger. 
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There are several ways of demonstrating validity by means of correla- 
tions. Figure 5.1 illustrates the simple correlation method. Test scores, 
shown horizontally, were correlated with instructors' ratings, shown ver- 
tically. The number in each cell shows how many individuals received 
the scores of that cell, as indicated on both axes. For example, two per- 
sons who made between 21 and 23 errors on the test were given trainer 
ratings of 12; then going to the bottom of the same column, we find that 
one person who also made between 21 and 23 errors had a trainer rating 
of 2. For this sampling of examinees, the coefficient is .46, which is well 
within the range of validity coefficients most often found for a single 
criterion in situations of this kind. One probable reason why the correla- 
tion is not higher is the subjectivity of ratings given by members of the 
training staff. 

"Table 5.1 presents another method of evaluating the efficiency of a cor- 


TABLE 5.1 
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relation coefficient of validity. Assume that prior to the use of a test, 50 
percent of the persons selected for a job proved to be satisfactory. These 
persons presently on the job constitute the criterion group with whom a 
certain test is being validated. In the top row of values are the selection 
ratios; that is, the percentage of new testees to be selected; namely, from 
5 to 95 percent. If, for example, an employer has 50 applicants for 20 
openings, the selection ratio is 40 percent. The smaller the percentage, 
the more stringent is the selection. In the column at the left are the pos- 
sible validity coefficients. A validity of .0o, obviously, means that the test 
is useless and that the percentage of successes will be no greater than it 
was without testing. 

If the validity coefficient is .25 and if the selection ratio is only .05 
(only the highest 5 percent on the test are to be selected), then 70 percent 
of those chosen are likely to prove satisfactory. It can be said, therefore, 
that use of a test having a .25 validity coefficient has increased the prob- 
ability of success to 7o percent in a situation where it was 50 percent be- 
fore. The remainder of the table indicates the probable value of, and 
contribution made by, the test as the coefficient of validity increases." 

Table 5.2 is similar in purpose to Table 5.1, except that the former 
provides more detail regarding relations between a person's rating on the 
test and his prospects of success (26). Assume that the percentage of suc- 
cesses is 8o, and percentage of failures is 20, on a given job. In that event, 


TABLE 5.2 
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the probabilities are that individuals whose test scores place them in the_ 
highest 10 percent of the group (according to the norms of the test) may 
be expected to prove satisfactory in 92 cases out of 100 if the validity co- 
efficient is as low as .30; and in 99 cases out of 100 if the coefficient is .60. 
By means of a table such as this, it is possible to make more accurate 
selections and predictions for the group as a whole, once the test's esti- 
mated validity is known. It is to be noted, however, that when validity co- 
efficients are relatively low (.30--40) and percent of failures is also rela- 
tively low (20-30 percent), the probabilities are about 50-50, or higher, 
that individuals whose test scores rank them in the lowest 20 percent of 
the distribution will be retained. 

One of the principal uses of psychological tests in schools is to provide 
objective data for use in planning an individual's subsequent education 
and in predicting probable performance. Table 5.3 presents data ob- 
tained for this purpose.’ 


TABLE 5.3 


CORRELATION BETWEEN TEST SCORES AND Four-YEAR GRADE AVERAGES 
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Inspection of these coefficients shows that, with this group, the A.C.E. 
Psychological Examination has no predictive value for pupils enrolled in 
the practical arts curriculum and low predictive value for those in the 
general curriculum; but it has moderate predictive value for individuals 
in college preparatory courses. It is important to note, also, that the co- 
efficients are appreciably higher for the group when treated as a whole. 
These data also demonstrate the importance of specifying the kinds of 
groups for whom a particular test has a certain degree of validity, rather 
than generalizing about it. 

The data in Table 5.3 indicate that although the validity coefficients of 
the D.A.T. are appreciably higher than those of the A.C.E. test, they, too, 
are most significant for the college preparatory group. The D.A.T. co- 
efficients also show that the subtests are not equally predictive. 

Biserial Correlation. This statistic is used when one of the 
measures is given in terms of only two categories: for example "pass" or 
“fail,” "satisfactory" or “unsatisfactory.” The second measure, however, is 
given in terms of variable scores. In Table 5.4 the four groupings on the 


TABLE 5.4 


BisERIAL CORRELATION BETWEEN SCORES OF 52 EMPLOYED STENOGRAPHERS 
ON THE SEASHORE-BENNETT STENOGRAPHIC PROFICIENCY TEST AND THEIR 
Supervisors’ RATINGS ON STENOGRAPHIC ABILITY. fpi = -60. 
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basis of supervisors' ratings (below average, average, above average, ex- 
cellent) have been reclassified into two categories (Groups 1 and 2) that 
have been correlated with stenographic proficiency test scores. The bi- 
serial coefficient of .60 indicates that the proficiency test has considerable 
value in identifying stenographers who will function at satisfactory or 
highly satisfactory levels. 

Tetrachoric Correlation. This index is found when a coarse 
classification of two measures is adequate for the purpose at hand. When it 
is used, the ratings in each measure are grouped into only two classes, 
providing a fourfold table. The data in Table 5.4 have been so reclassified 
in Table 5.5, yielding a tetrachoric coefficient of 4.60. 


TABLE 5.5 


Founrotp TABLE FOR COMPUTATION OF TETRACHORIC 
CORRELATION COEFFICIENT 


Trop = 00; 
Ratings 
vati DEO 
3-5 6-8 
1 19 High 

$ 2 (11.596) (36.5%) on Test 
S5 o l 
É 17 10 Low 
a 2 (32.7%) (19.396) on Test 


Rated Low Rated High 
Source: The Psychological Corporation. 


ons necessary for calculating the 
correlation), or the coarse group- 
Iculations, will depend upon the 
e for which validation is to be 


‘Whether one uses the finer classificati 
product-moment coefficient (the "simple 
ings shown in biserial and tetrachoric ca 
nature of the data available and the purpos 
used. 

Since the tetrachoric correlation coefficient is much less accurate than 
the product-moment r, the former should be used only when approxima- 
actory, or when the coefficient is computed 


tions, not t ise, are satisf 
oo precise, are sa fic 
from a very rg number of cases, Or where the original data are in- 


herently dichotomous. The tetrachoric correlation is sometimes used be- 


Cause, with the aid of tables or nomographs, it is easier to compute than 


the produ fficient. 
ct-moment coethci ^ E 
à Multiple Correlation) This method is used when two or more 


Measures are to serve as predictors. The scores of the measures are statisti- 


* The statistical procedures involved in calculating the coefficient of multiple correla- 
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cally combined and correlated with a third measure to yield a single co- 
efficient of multiple correlation. Whereas the simple product-moment 
coefficient indicates the degree of relationship (or covariation) between 
paired scores of two sets of measures, the multiple-correlation coefficient 
shows the relationship between one set of scores (for example, college 
grades or proficiency ratings) and the composite of two or more sets of 
other measures (for instance, an intelligence test and college entrance ex- 
amination grades). The multiple-correlation coefficient provides an in- 
dex that represents the best combination of two or more measures for the 
purpose of predicting scores of a third variable. This coefficient. indi- 
cates how well a certain composite of predictors (for example, average 
of high school grades and score on a scholastic aptitude test) will cor- 
relate with a criterion (for example, college grade averages). Although 
each of two tests, considered separately, might have a low or a moderate 
correlation with a criterion, the two scores will generally correlate higher, 
or quite significantly with the criterion, when treated as a composite. This 
is the case because the two tests in combination have more elements, or 
factors, in common with the criterion than does either test in itself. The 


following data provide an illustration: !? 
Variable 1 is college grade averages. 
Variable 2 is scores on a scholastic aptitude test. 


Variable 3 is high school averages. 


rp = 05 
Tag = -52 
Tog = -48 


Then Ries) = .69 


In this instance, when variables 2 and 3 are treated as a composite, the 
multiple-correlation coefficient has risen only slightly because variables 1 
and 2 are already quite significantly correlated, and variables 2 and 3 are 
markedly correlated. R would have been higher if 723; had been signifi- 
cantly lower than it is. 

In an instance such as the foregoing, some psychologists and educators 
would conclude that there is really no justification for considering both 
predictors, since the scholastic aptitude test itself correlates with the 
criterion about as well as the composite, as shown by the R of .69. This is 
(uS statistically because the increment in the coefficient may not be sig- 
nificant. This is a conclusion that could be accepted if we Were in- 


tion and the logic of the method are not within the scope of the present discussion. Stu- 


dents who want to learn the procedure and the logic should consult a standard textbook 
on statistics. 


10 
The symbol for the coefficient of multiple correlation is R. 
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terested solely in group trends and group probabilities. But since an r of 
.65 and an R of .69 are both far removed from unity, there will be indi- 
vidual exceptions to any conclusions drawn from them; so that, if we are 
concerned with individuals in a practical situation, we will utilize as many 
sources of information as may be available that contribute information 
about the individuals tested. 

Expectancy Tables. These provide a relatively simple, straight- 
forward, and valuable method of estimating predictive efficiency of a test. 
Estimates are based upon the calculated probabilities that an individual 
who has a given test score will achieve a specified score or rating in the 
performance being predicted. We might ask, as examples, the following 
questions: What are the probabilities that a. prospective college student 
scoring in the highest decile group on à scholastic aptitude (intelligence) 
test will remain in college a given number of terms? What are the prob- 
abilities that a child with an IQ of 80-85 will be able successfully to com- 
plete the work of the eighth grade? What are the probabilities that a 
candidate getting an average score on a stenographic proficiency test will 
achieve a rating of "excellent" or "above average" on the job? Appro- 
priate expectancy tables are intended to answer similar questions. 

Table 5.6 is an illustration in point. It presents part of a larger table 
répresenting all ten decile groups. 


TABLE 5.6 


RANK ON A SCHOLASTIC APTITUDE TEST 
AND SEMESTERS COMPLETED 
(IN PERCENTS) 


Terms? 


DECILE 


vm oa US mee 78 78 
799 374 37A 785 B 
TLA MURS am a O 56 
6i. (60, nee Sl edd» 48 


Decile Rank 
= 
Ke] 
N 
oo 
rw 


a Decile rank X is the highest; I is the lowest. 
can be said that the probability is 


To take two items from Table 5.6, it $ 
that 88 in 100 of the students in the highest decile group on the scholastic 


Aptitude test will complete their academic course, whereas only 48 in 
1 G i 
90 of the lowest decile group will do so. 
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Table 5.7 illustrates the use of expectancy data in personnel selection. 
Inspection of this table shows that it may be used to indicate what per- 
centage of individuals obtaining each of the several ratings on actually 
demonstrated stenographic ability may be found at each of the several 
levels on the proficiency test. It is also possible to calculate the per- 
centages by rows (instead of columns) so as to indicate the converse: the 
frequencies of the several ability ratings within each of the score intervals 
of the proficiency test. 


TABLE 5.7 


EXPECTANCY TABLE SHOWING THE NUMBER AND PERCENT OF STENOGRAPHERS 
OF Various RATED ABILITIES WHO CAME FROM SPECIFIED SCORE GROUPS ON 
THE S-B STENOGRAPHIC PROFICIENCY TEST 


(N = 52, mean score = 15.4, S.D. = 2.9, r= .61; 


score is average per five letters.) 
oe er re 


Number in each score group Percent in each score group 
receiving each rating on sten- receiving each rating on sten- 
£ g 


Stenographic É 5 
ographic ability ce es ographic ability 
Below Above Test scores Below Above 
aver- Aver- aver- Excel- aver- Auver- aver- Excel- 
age age age lent age age age lent 
4 6 7 18-19 17 40 64 
2 2 4 16-17 9 13 36 
10 5 14-15 44 33 
4 1 12-18 17 7 
2 1 10-11 67 9 7 
d 1 8-9 33 4 
Wa ae EES 
3 3 | 15 n 100 100 100 100 


Source: The Psychological Corporation. 


Comparison of Tables 5.6 and 5.7 demonstrates that expectancy tables 
need not be uniform. The form and arrangement of data will depend 


upon the particular probabilities one desires to determine. But all eX- 
Pectancy tables for tests have this in common: 
of the probabilities that a certain level or quali 
be expected if the test score is known—that is 
probabilities in place of or, i 
efficient. 
Figure 5.2, 


they provide estimates 
ty of performance may 
its validity in terms of 
more often, in addition to a correlation co- 


a bar diagram, shows still another method of representing 
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expectancy. Although three categories for elimination are shown, it is 
possible to estimate the percentages of elimination at each test level and, 
thus, to estimate the predictive efficiency of the battery of tests used in 
selecting candidates for pilot training. The term "stanine score" means a 
unit consisting of one-ninth of the total range of the standard scores of 
à test.!! Thus, the percentage eliminated from among candidates whose 
Scores placed them in the highest one ninth of the scores was only about 
5 (about 4 percent for flying deficiency), nearly 80 were eliminated from 
among those in the lowest (approximately 70 percent for flying deficiency). 


PERCENT ELIMINATED 


PILOT TOTA 40 50 60 70 80 90 100 
STANINE NUMBER 9 o 20 29 nt 
9 14,682 Eliminated for physical or 
| | gaministeativa, reasons, 
8 14,286 oir sick or killed 
s C Eliminated for fear or 
7 24,367 A own request 
Eliminated for flying 
6 30,066 IZA deficiency 


5 — suos [P 


— ——Ó—ÀÀ ————À 
o 10 20 30 40 50 60 70 80 90 100 
Fic, 5.2. Percentage of candidates eliminated from primary pilot training clas- 
Sified according to stanine scores on selection battery. Reproduced from Psy- 
chological activities in the training command, Army Air Forces, by the Staff, 
Psychological Section, Fort Worth, Texas. Psychological Bulletin. Washington, 


C: American Psychological Association, Inc., 1945. 


Cut-Off Score. This is a variation on the expectancy method. It 

is a test score used as a point of demarcation between examinees who will 

€ accepted and those who will be rejected. For example, Table 5.8 shows 

Several values from the Cornell index (a personality inventory) that might 
€ taken as cut-off scores. 

A low score on this inventory signifies fewer personality difficulties, 

hence it is more desirable. The table reads: "If a cut-off score of 7 on the 


| " “Stanine” is a condensation of “standard nine.” The total range of scores is divided 
Into nine equal units; the mean stanine is 5. The method was developed and the term 
Was coined by psychologists in the United States Air Force during World War II. 
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TABLE 5.8 


PERCENT OF PSYCHIATRIC REJECIS? AND ACCEPTS? 
IDENTIFIED BY THE CORNELL INDEX 


Cut-off level 400 rejects — 600 accepts 


7 86 28 
13 74 13 
23 50 4 


ee ee SS 
* Based upon psychiatric interviews. : 
Source: Manual, The Psychological Corporation. 


Cornell index were used with this group of 1000 persons, 86 percent of 
those who were rejected after the interview would have been rejected also 
by the index; but 28 percent of those accepted after the interview also 
would have been rejected by the index." The other percentages are read 
in the same way. Since the higher scores in this instance are the less de- 
sirable, and if 7 is taken as the cut-off level, we mean that a score of 7, or 
lower, would be necessary for acceptance. If 23 were the cut-off, then any- 
one with that score, or lower, would be acceptable. In this instance, the 
cut-off score becomes less selective as it increases. 

If we are using a test on which larger scores signify higher and more 
desirable levels of the ability or trait being evaluated, then the cut-off 
score becomes the more selective as it is increased. Table 5.9 is a case 
in point. 


TABLE 5.9 


PERCENT OF SuPERIOR AND INFERIOR WORKERS ÍDENTIFIED BY 
A PROFICIENCY TEST 


Superior workers Inferior workers 
Cut-off score Accepted Rejected Accepted Rejected 


20 100% 0% 80% 20% 
25 90 10 60 40 
30 80 20 40 60 


The example shown in Table 5.9 is interpreted thus: If 20 were set aS 
the minimum acceptable score, then all examinees who proved to be 
superior workers would have been employed; but so would 80 percent of 
those who proved to be inferior workers. 

It is clear that cut-off scores are especially useful in instances where 
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many more candidates are available than there are places to be filled, 
so that the cut-off level may be made highly selective, and where one is 
not concerned primarily with the individual candidates as such, but 
rather with the places, jobs, or niches to be filled. The purpose in using 
cutoff levels is to identify a maximum number of potentially superior 
or desirable persons and, at the same time, to eliminate a maximum 
number of inferior or undesirable individuals. Since no test has perfect 
validity, and since the true potential of some persons may not be re- 
vealed by a single test, screening by means of cut-offs will not be per- 
fect. Some desirable persons will be rejected, whereas some undesirables 
will be selected. Yet, cut-offs and other methods have a considerable ad- 
Vantage over subjective procedures previously employed; for they provide 
data for estimating with greatly increased objectivity and accuracy the 
chances of identifying the persons with the desired abilities or traits. 

Differential Predictors. Some tests are adversely criticized, at 
times, because they do not predict differentially; that is, their predictive 
validity—especially in the form of correlation coefficients—is much the 
Same for two or more types of criteria. For example, it may happen that 
Scores on test A correlate +.55 with grades in college courses in the social 
Sciences and the humanities, while the correlation is 4-.45 with grades in 
the physical and biological sciences. It would be argued by some, there- 
fore, that the test is not useful as a selective instrument because the cor- 
relations are too close; it does not, they urge, differentiate. Or if test B 
Correlates at +.45 with performance in one job category and +.40 with 
another, the same argument would be advanced. . 

This argument is misleading. While it is true that these instruments 
have not differentiated between the criteria in each instance, each test 
Can still be useful as one source of evidence in making selections of stu- 

€nts or workers. The adverse criticism fails to take into account the fact 
that human capacities, abilities, and skills—and to some extent, the 
Personality traits being rated by means of inventories and projectives— 
are Positively interrelated, and that psychological decisions and judg- 
ments should not be based upon scores alone. Study of “personality pro- 

les” will demonstrate this point, as will the following illustration. In one 
University, these correlations were found: 

* for grades in liberal arts college courses—freshman year—with general in- 


telligence test score = 48 i 
r for grades in engineering school courses—freshman year—with the same 


test of general intelligence = .41 


Yet, of the 1400 students in liberal arts who were tested, nearly go percent 
of those in the highest decile group, and nearly 80 percent in the second 
ighest Braduated, while in the lowest and next to the lowest decile 


Broups, only 48 percent, and 56 percent, respectively, graduated (after 
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serious scholastic difficulty). Thus, while the two coefficients did not dit- 
ferentiate well between the group in liberal arts and that in engineering, 
the test was valuable in helping to estimate the probabilities of scholastic 
survival of individuals scoring at each of the several levels on the intel- 
ligence test. . i 

Adverse critics also disregard the fact that achievement in education or 
on a job is a complex matter, involving not only intellectual, sensory, and 
motor capacities, but also motivational and personality factors. The view 
here expressed does not minimize the desirability of the efforts made to 
construct differential aptitude tests, which are evaluated in Chapter 17. 
The psychologists’ task would be considerably simplified if human capac- 
ities and abilities were highly specific and specialized; but that does not 
appear to be the case. 

Other Methods. In later chapters it will become apparent that 
still other methods, in addition to those already explained, are used to 
estimate validity. Among these are the percentage who are successful, on 
a test or on individual items, in adjacent age groups and in groups of 
known ability (inferior, average, superior); the statistical significance of 
increases in scores from age to age; and closeness of the distribution of 
scores to the normal frequency curve. Also, in validating personality tests, 
the extent of agreement, among specialists, in scoring and in the inter- 
pretation of results is an accepted criterion. 

There are instances, too, when very low correlations may be regarded as 
evidence of a test's validity. For example, if one constructs a test of me- 
chanical aptitude, based upon the hypothesis that this is a special apti- 
tude and, as such, is relatively independent of the mental operations 
measured as general intelligence, then in constructing the test of me- 
chanical aptitude its author should, among other things, aim to devise an 
instrument that has a low or negligible correlation with tests of general 
intelligence. 

Low correlations, or other evidence showing absence of agreement, ob- 
tained by means of a test, may be indicative of the test's validity in assess- 
ing personality traits; as, for instance, when a person has undergone suc- 
cessful psychotherapy and has been tested before and after therapy. In this 
case, appreciable changes in certain traits would be expected and should 
be shown by the test. On the other hand, if an individual's personality 


difficulties or disorders are known to have increased over a period of time, 
the test should, if valid, reflect these changes. 


Item Analysis 


With very few exceptions, psychological tests (other than some 
projective techniques) are made up. of a large number of items. The 
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score on each item is added to the scores of the other items to obtain a sub- 
test score or a total score, either or both of which are used in calculating 
reliability and validity. Ultimately, however, the quality and merit of a 
test depend upon the individual items of which it is composed. It is there- 
fore necessary, in best practice, to analyze each item in the standardization 
process in order to retain only those that suit the purposes and rationale 
of the device being constructed. Item analysis is thus an integral part of 
both the reliability and the validity of a test. 

In evaluating items, two major aspects are considered: the level of dif- 
ficulty of each and the discriminative value of each. 

Difficulty Level. This aspect is determined by the percentage 
of individuals able to pass each item. In practice, if an item is to dis- 
tinguish among individuals, it should not be so easy that all persons can 
Pass it; nor should it be so difficult that none are able to pass it.!? It can 
be demonstrated statistically that an item passed by 50 percent of a group 
discriminates between more pairs of persons than does an item passed by a 
smaller or larger percentage. For example, if an item is attempted by 100 
Individuals and passed by only 10, and if the testees are taken by pairs, 
there are 900 (10 X go) combinations in which that item can discriminate 
between paired members of that group. If the item is passed by 5o in the 
8toup, then the number of possible discriminations between paired indi- 
Viduals is 2500 (50 X 50), this being the largest number possible, as the 
Multiplication of any other proportions will show.!? Obviously, not all 
items in a scale are, or should be, so easy as to be passed by 50 percent of 
the 8toup. Some are included that are passed by large percentages and 
Some by small percentages, with many degrees between the extremes. 

,lhere is no formula for determining the exact distribution of item 
difficulties. A common practice is to select some items whose difficulty 
'5 at, or close to, the 50 percent level, and other items with a wide range 
of degree of difficulty, in terms of percent passing. If all items selected 
or inclusion in a test were at the 5o percent level of difficulty, the test 
Would, theoretically, simply divide the testees into two groups: those above 
this Predetermined dividing point and those below it. Such items would 
Not differentiate among the individuals in the group above the 50 percent 
evel, nor among those below it. Hence, for maximum differentiating ef- 


Clency, a test must contain items at various levels of difficulty as repre- 


v Theoretically, it would be desirable that the test be so scaled that there is at least 


a s item that can be passed by all for whom the test is intended. For zero ‘Scores on 
Sicula test do not necessarily mean absolute zero capacity in the function being 

Would t2 nor will all zero scores necessarily signify the same status. Conversely, it 

test ig į e desirable that a test be scaled upward to a level where no one for whom the 

that the ended is able to pass the highest item. This aspect would require, of course, 
my, a. (5t be constructed by a person superior to any of the intended testees. 

I3 Not to be assumed that “go percent passing” is necessarily the best criterion in 


t 
Placi É A 5 
tng an item in an age scale (like the Binet), as will be seen later. 
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sented by percentages passing. The final consideration will be the inclu- 
sion of items of such a range of difficulty as to yield the highest predictive 
index when compared with the criterion, allowing for the levels of poe 
ability or trait to be measured and the degree of differentiation to 5e 
achieved. . à 

In terms of difficulty, the discriminative value of a test item Is the de- 
gree to which performance on it satisfactorily differentiates among indi- 
viduals who vary in regard to the characteristic being measured. How- 
ever, differences in percentages of individual passing items do not indicate 
differences in amounts of the characteristic. These percentages show only 
an order of item difficulty. "t 

Difficulty levels can be given also in terms of the standard deviation of 
the normal curve. Thus, if 84 percent of testees pass an item, 1t has a 
rank of —1 SD (one standard deviation below the mean). If an item 1$ 
passed by 16 percent, its rank is +1 SD; if passed by 69 percent or by 3! 
percent, the ranks would be —o.5 SD or +0.5 SD, respectively.1° Some 
psychologists prefer this index because the standard deviation I i. 
property of the normal curve; but basically, like percentage passing, 1t 15 
an index of relative rank. Ó 

Determination of difficulty levels is significant not only for a test's dis- 
criminative value; it is essential also if two equivalent forms of the instru- 
ment are to be devised, from the points of view of both degree of dif- 
ficulty of each item and over-all range of difficulty. 

In constructing a speed test, it is essential, obviously, that all items be 
of uniform difficulty or nearly so. Also, if a test is to serve only as a screen 
ing device to divide testees into two groups (pass-fail, satisfactory-unsatis- 
factory), it is necessary that a predetermined level of difficulty be selected, 
that items be of increasing difficulty up to that level, and that a number of 
items be concentrated at that level. If further differentiation is desired, it 
will be necessary, of course, to include increasingly difficult items. 

Discriminative Value: Correlation. Validity of items may be 
estimated by finding the biserial correlation coefficient of each item with 
the score of the subtest of which it is a part (such as arithmetical prob- 
lems, similarities) to determine whether or not performance on it is COP- 
sistent with performance on the subtest as a whole. For this purpose, each 
item is scored as “pass” or "fail"; or "plus"or “minus.” This procedure 
assumes that all items in the subtest are expected to be relatively homo- 
geneous in regard to the psychological Operations or traits they are in- 


toe and interpretations of scores on psychological tests are discussed in 
apter 6. 


= H 
These percentages are based upon the standard deviation of a normal distribution 


ete s table showing percentages of the distribution included within various frac 
s of the standard deviation are found in most textbooks on statistics. 
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tended to measure. Or, otherwise stated, success or failure on each item 
within the subtest is regarded as indicative, to some degree, of a specified 
kind of trait or mental operation, or of a similar combination of char- 
acteristics. 

Successes and failures on each item may also be correlated with the total 
scores for the whole test. When this is done and an appreciable positive 
correlation is wanted, the assumption is that the items in all the subtests 
are expected to be relatively homogeneous in regard to characteristics 
measured. If the purpose of the scale is to measure relatively independent 
operations or traits by means of each of the subtests, neither the subtest 
scores nor the individual items within each subtest are expected to show a 
marked correlation with total-test scores. 

The biserial correlation of performance on each item with its subtest 
scores and with total-test scores is found in order to identify, at each level 
of difficulty, items that will contribute most to the validity of the test as a 
whole. This is the process of establishing internal validity.1 

Determining biserial correlations for all items is a long and laborious 
task, unless the work is done by calculating machines. It is not uncommon, 
therefore, to find that items have been correlated only with their subtest 
Scores, followed by intercorrelations (product-moment) of the subtest 
scores with one another and with the total-test scores. 

Discriminative Value: External Criteria. As emphasized 
throughout our discussion, it is essential that validity be determined, ulti- 
mately, by comparison of scores with one or more external criteria, after 
9r during the process of internal item analysis. Thus, each item may be 
analyzed to determine whether it discriminates satisfactorily between a 
low and a high group, as classified under one or more of the external 
Criteria detailed earlier in this chapter. As already stated, very few items 
Should be within the ability range of all, or nearly all testees. Others 
Should be of increasing selectivity. Some items should, of course, dis- 
criminate between two extreme groups, for example, the highest and 
lowest 10 percent of the population tested. But it 1s desirable to have 
items whose selectivity extends beyond these narrow boundaries; items 
that would also dependably distinguish between, for example, the highest 
one-fourth and the second-highest one-fourth and between the lowest one- 
fourth and the next-lowest one-fourth. Kelley has offered evidence indi- 
cating that most marked and significant discrimination between extreme 
8toups is obtained when item analysis is based upon the highest 27 per- 


"i To begin with, before statistical analyses are made, the author ot the test had em- 

Poya the principle of "content validity” or of “construct validity." The statistical 

x. YSes refine and improve the original group of items that were devised according to 
€T of these principles. 
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cent and the lowest 27 percent of the group. This method, however, pro- 
vides only a crude item differentiation, since it does not provide a basis 
for differentiating among 50 percent of the population, the large middle 
; see (15). 

st n E ns one procedure would be to find what peremtage of 
the highest 27 percent and what percentage of the lowest 27 percent 
passed each item; then, by statistical calculation, to determine if the dif- 
ference between the two percentages is significant. The same method can 
be followed with other proportions as well. In fact, items may be analyzed 
with regard to a wide variety of group classifications. Each item might, 
for example, be analyzed with reference to high, average, low average, and 
low groups, classification being based upon total-test scores or upon €x- 
ternal validating criteria. ; 

Items should discriminate between some kinds of groupings but not 
others, depending upon the purpose of the test. For example, the items in 
a test of general intelligence should not favor either sex, whereas » sad 
sonality inventory intended to evaluate degree of “masculinity anc 
“feminity” should discriminate sex differences. If a personality inventory 
is to differentiate among two or more clinical classifications or behavior 
syndromes, then here, too, items that show significant group differences 
in responses will be chosen. On the other hand, in constructing tests of 
intelligence, psychologists have emphasized that they should be "culture- 
fair"; that is, they should not favor some socioeconomic groups and be 
unfair to others.17 

Speed Tests. Devices that are primarily tests of speed present a 
special problem. The position of each item in the series affects its apparent 
level of difficulty and its apparent validity, unless all individuals in the 
standardization group have had an opportunity to answer every item. In 
a speed test, all items should be of uniform, or nearly uniform, degree of 
difficulty. Having determined the level of difficulty desired, the best prac- 


tice, then, is to correlate subtest scores and total-test scores with scores of 
criteria, under various time limits. 


Validating Information 


The objective of all validating procedures is to make the most 
useful selection of test types and test items from among those available, so 
as to yield the highest prediction of the criterion or criteria. To do this, the 
test’s author must have insight into the psychological processes involved. 
In addition, he must write the test items clearly and precisely. Then the 


dh rm of item analysis, other than those here discussed, have been proposed 
» 21, 33). 
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ultimate decision on the criteria of validity in any area of testing rests 
upon the analytical judgments and agreement of qualified specialists, who 
evaluate the test's objectives and the groups for whom it is intended. 

Anyone who uses a test professionally—and none but professionally 
qualified persons should—will want to know certain essential facts about 
the instrument's validation. The following information should be given 
in the test's manual: 


The purpose of the test and the group or groups for whom intended. 

The type of validation that has been employed: content, construct, con- 
current, predictive, or a combination of these. Whichever one, or combina- 
tion, has been used, should be explained and justified on the basis of either 
Psychological or educational principles, or both. 

The techniques used in item analysis. 

The external validating criteria, if any; and the reasons for using them. 
Correlation coefficients, expectancy tables, and standard errors of measur- 
ment should be included whenever the data lend themselves to such treat- 
ment. The source of the external validating criteria should be stated: 
teachers’ grades; objective achievement tests; supervisors' ratings; clinical 
diagnoses by psychologists or psychiatrists; other standardized tests. . 

The standardization sample of the population. The given information, 
where essential, should include age range, sex distribution, socioeconomic 
distribution, range of ability or trait variation, educational level, type of 
School, 

Separate validity findings for different age groups, grade groups, ability 
Broups, behavior syndromes, clinical groups, culture and subculture groups, 
and Occupational groups. In fact, whenever membership in a particular 
8toup might produce differences, or whenever a test is intended to make 
Broup differentiations, separate validity findings should be provided. AA 

Cross-validation data, if any; and the characteristics of the cross-validating 
Broups, 

Influence of any special factors, such as speed of work, auditory discrimi- 
nation, color vision or visual acuity, manual dexterity, where these are not 
the primary concern of the test, but can influence the scores or ratings. 
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6. 


INTERPRETATION OF TEST SCORES: 
QUANTITATIVE AND QUALITATIVE 


An Index of Relative Rank 


The raw score (that is, the actual number of units or points) 
obtained by an individual on a test does not in itself have much, if any, 
significance. One test may yield a maximum score of 150, another 200, 
and a third 300. Obviously, then, any point score on one of these tests 
is not directly comparable with the same number of points on either of 
the others; a score of 43 on one test cannot be directly compared with a 
score of 43 on another. Furthermore, the average scores of each of these 
will in all probability be different, as will the degree of variation of 
scores (called the deviation) both above and below the average. For ex- 
ample, the average (mean) score of the first test for a given age is, let us 
say, 9o, with approximately the middle two thirds of the scores falling 
between 75 and 105. For the second test the mean is 120, with the middle 
two thirds of the scores between 100 and 140; while for the third test the 
mean is 180, with the range of the middle two thirds between 150 and 210. 

It is clear that if scores obtained on each of several tests are to be com- 
pared, indexes must be used which will express the relative significance of 
any given score; or what is known as relative rank. In the example given 
above, assuming that all three tests are intended for the same group 
the mean scores of go, 120, and 180 have the same relative significance— 
PE ares DT these scores would be at the average in each. 

, 75, 100, and 150 have the same rel 


ative significance in 
120 
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their respective tests; for persons getting these scores would be one stand- 
ard deviation below the means (averages), which signifies that their scores 
surpass only about sixteen percent of all the scores made by the popula- 
tion sampling upon which the tests were standardized. 

Innumerable other comparable points and scores could be selected for 
illustration. Obviously, however, such score-for-score comparisons would 
be extremely cumbersome and would, in each instance, have to be in- 
terpreted in terms of some common, meaningful index. Hence, to facili- 
tate interpretation, sound psychological tests will provide tables of age 
norms, or grade norms, or percentile ranks, or decile ranks, or standard 
scores, depending upon the instrument's purpose. Other kinds of norms, 
suitable to the test, should also be provided. 


Norms 
A norm is the average or typical score (mean or median) on a 


particular test made by a specified population: for example, the mean in- 
p of 10-year-old children; the mean score 


telligence test score for a grou } 
ils on a test of arithmetic fundamentals. 


for a group of fifth-grade pup 
Reference to a test's table of norms enables us to rank an individual 
pupil's performance relative to his own and other age or grade groups. 
For example, a child of 10 might have an intelligence test score that is 
average for his own age, or for a population of g-year-olds, or for those 
who are eleven years of age. On a test of arithmetic fundamentals, a 
fourth-grade pupil’s score might be typical for his level, or for that of the 
grade above or below. 

Since it is desirable to locate an individual’s score and relative rank not 
only with reference to an average; put also with reference to other levels 


in the scale, tables of norms should include the frequency distribution of 


the scores, from which percentile ranks and standard scores may be readily 


calculated, if they are not already provided in the test’s manual! (These 
indexes are explained in later sections of this chapter.) Table 6.1 is an 
illustration of this practice. From this, on the basis of his score, an indi- 
vidual's relative position may be readily determined with. respect to his 
own and other age groups. T able 6.1, as a type is of additional interest 
and value because it shows the range of scores within each age group, and 
the extent of overlapping of scores from age to age, and between any two 
age groups one might want to compare. 


Table 6.2 represents another type of presentation. Here norms are 
ts not only the single value, the average, but 


1 " 
For this reason, a table of norms presen 
all performance of the group. 


a ^ 
Iso a range of values representing the over- 


TABLE 6.1 


THE FREQUENCY DISTRIBUTION OF Scores FoR EACH AGE GROUP 
CuicAGO NONVERBAL EXAMINATION (VERBAL DIRECTIONS) 


OM o Qr (ue ue 


Score Age 
60 7-0 80 9-0 10-0 11-0 12-0 13-0 14-0* 
to to to to to to to to and 
6-11 7-11 8-11 9-11 10-11 11-11 12-11 13-11 above 
180-184 1 
175-179 1 9 
170-174 0 9 
165-169 1 5 21 
160-164 2 1 6 49 
155-159 1 6 ll 70 
150-154 1 0 5 18 86 
145-149 0 5 13 16 129 
140-144 3 5 19 35 114 
135-139 1 12 20 29 130 
130-134 2 8 16 31 35 175 
125-129 1 2 14 18 41 37 190 
120-124 0 7 18 35 58 52 172 
115-119 1 9 18 36 41 45 145 
110-114 3 19 33 36 24 30 122 
105-109 6 15 43 30 37 29 122 
100-104 ll 29 35 41 28 21 95 
95-99 2 27 27 41 22 17 14 62 
90-94 6 22 45 42 26 10 11 44 
85-89 12 40 32 29 13 10 11 19 
80-84 2 11 37 37 19 7 4 3 24 
75-19 3 28 43 33 20 7 6 4 12 
70—74 7 32 29 19 8 3 2 2 14 
65-69 16 47 33 21 5 3 1 4 11 
60-64 12 56 27 8 4 2 1 1 3 
55-59 29 43 20 4 4 1 0 4 6 
50-54 43 45 12 4 1 0 0 4 
45-49 31 32 T 0 1 1 2 2 
40-44 35 23 9 3 1 1 2 
35-39 26 16 6 1 1 1 0 
30-34 26 — 16 3 0 0 0 
25-29 22 10 3 1 0 1 
20-24 18 6 ii 1 l 
15-19 21 6 2 0 
10-14 14 6 1 1 
5-9 6 4 
0-4 7 2 
Gal di te 4 d : 318 352 324 379 423 1844 
SD : . B 89.0 99.3 110.0 118.5 122.8 125.6 


17.6 18.1 18.7 172 18.8 18.6 18.6 20.8 22.2 
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TABLE 6.2 
AcE Norms oF Oris INTERMEDIATE EXAMINATION 
(30-Minute Time Limit) 
Months Years 
18 
8 9 10 1l 72 B H 15 16 17 orover 
0 7 15 23 31 38 44 49 53 56 58 59 
1 8 16 24 32 39 44 49 53 56 58 
2 8 16 24 32 39 45 50 53 56 58 
3 9 17 2 33 40 45 50 54 57 58 
4 10 18 26 34 40 46 50 54 57 58 
5 10 18 26 34 41 46 51 54 57 58 
6 ii 19 27 35 41 46 51 55 57 59 
ji 12 20 28 35 42 47 51 55 57 59 
8 12 20 28 36 42 47 52 55 58 59 
9 13 21 29 36 43 48 52 55 58 59 
10 M 22 30 37 43 48 52 56 58 59 
11 14 22 30 37 43 49 53 56 58 59 


Source: A. S. Otis (14). By permission. 


Biven, at monthly intervals, for the Otis Self-Administering Test of Men- 
tal Ability. 

„Table 6.3 illustrates the importance of providing separate norms and 
distributions of scores for each of several groups within a broad category. 
In this instance, though all had completed a college preparatory course 
and were college freshmen, there are differences among the several groups, 
Some of them being quite significant. Inspection of the table shows that 
an identical score in each of the groups does not signify identical relative 
Tanks. For example, a score of 119 gives a percentile rank of 37.5 in the 
B.A. Broup, 57.5 in the Business group, and 42.5 in the Engineering 
group (see the note under Table 6.3). This kind of table is essential for 
Purposes of guidance and selection. 

E he characteristics of any table of norms will depend on a number of 

Ctors affecting the individuals who make up the group. For example, in 
Standardizing a psychological test, the norm and the distribution of scores 
will be influenced by the representativeness of the population sample; 
that 1s, by the proportion from each sex, their geographic distribution, 
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TABLE 6.3 


Norms FOR CoLLEGE FRESHMAN (MALE): COLLEGE QUALIFICATIONS TEST 
ARRANGED ACCORDING TO TYPE OF CURRICULUM 
(Raw Scores) 


Percentile a B.A. B.S. Business Nursing Engineering 
99 186-200 185-200 174-200 174-200 184-200 
90 164-170 161-168 146-152 148-154 161-166 
80 153-157 — 151-155 136-140 136-139 149-153 
70 145-147 142-145 127-131 128-131 140-143 
60 136-139 133-136 119-122 120-122 132-135 
50 127-131 126-129 111-114 115-117 124-127 
40 119-122 117-121 102-105 106-110 115-119 
30 108-112 107-111 93-96 100-101 107-110 
20 96-101 95-100 83-87 94-95 97-101 
10 78-87 79-87 73-16 78-87 84-90 
N 2359 3871 939 254 2576 
Mean 126.6 125.1 111.9 115.7 125.1 
SD 31.1 30.3 28.1 25.7 28.5 


LLLA Immo c N ÓÓÁ 


* Each percentile value in this table represents a range of percentile points of which 
the indicated percentile value is the mid-point. Thus the percentile value of 50 repre 
sents 47.5752.5. 

Source: G. K. Bennett et al. (1). By permission. 


their socioeconomic status, and their age distribution. In devising a test of 
educational achievement, factors influencing the normative data, in addi- 
tion to the foregoing, are the quality of the schools and the kinds of 
curricula from which the standardization population is drawn. Norms 
of tests of aptitude (for example, clerical or mechanical) are influenced by 
the standardization population’s degree of experience, the kind of work 
they have been doing, and by the representativeness of the group. 

The point to be emphasized, therefore, is that tables of norms derived 
for each of several tests classified under the same name and intended for 
the same purposes are not necessarily comparable. Before deciding on the 
selection and use of a test, it is always necessary to know the character 
istics of its standardization population. This information is essential 1” 
determining whether the instrument is appropriate in a given situation. 


Percentile and Decile Ranks 


} Percentile Rank. An individual's percentile rank on a test 
designates the percentage of cases or scores lying below it. Thus, a person 
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having a percentile rank of 20 (P29) is situated above twenty percent of 
the group of which he is a member; or, otherwise stated, twenty percent 
of the group fall below this person's rank. A percentile rank of 70 (P79) 
means that seventy percent fall below—and so on, for any percentile rank 
in the scale. In effect, this statistical device makes it possible to determine 
at which one-hundredth part of the distribution of scores or cases any 
particular individual is located. By this means a person's relative status, 
or position in the group, can be established with respect to the traits or 
functions being tested. And, as will be seen, psychological measurement, 
unlike physical measurement, derives its significance principally from rela- 
tive ranks ascribed to individuals rather than from quantitative units of 
measurement. 

A table of norms and frequency distribution often provides percentile 
ranks (see Table 6.3). Or, if the percentile ranks themselves are not given 
in a table, it is possible to calculate them easily from the frequency dis- 
tribution. 

The percentile method is a technique whereby scores on two or more 
tests, given in units that are different, may be transformed into uniform 
and comparable values. This method has the advantage of not depending 
upon any assumptions regarding the characteristics of the distribution 
with which it is used. The distribution might be normal, skewed, or 
rectangular. When a percentile rank is given for a particular individual, 
it refers to his rank in the specified group of scores from which it has been 
derived. On a test of reading comprehension at the fourth-grade level, for 
example, a percentile rank of 60 fora particular pupil is relevant to the 
group of pupils for whom the distribution of scores was found. Whether 
or not this same pupil would be rated at percentile 60 as a member of 


IO 20 3040506010 80 90 
Fic. 6.1, Unequal distances between points on the base line of a nor- 


mal curve by successive 1e-percent divisions (deciles) of its area 
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another fourth-grade population will depend on the comparability of the 
two groups. His rating might be the same, or higher, or lower. (In this 
connection, consult Table 6.5, giving norms for college freshmen.) 
Percentile points are based upon the number of scores (cases) falling 
within a certain range; hence the distance between any two percentiles 
represents a certain area under the curve; that is, a certain number of 
cases (N/100). Reference to Figure 6.1 shows that if percentages of the 
total area (total number of cases) are equal, the distances on the base line 
(range of scores) must be unequal, unless the distribution is rectangular 
(Fig. 6.2).? It is obvious from Figure 6.1 that differences in scores between 


[*] IO 20 30 40 50 60 10 80 90 100 


Fic. 6.2. Equal distances between points on the base line of a rectangle 
by successive 10-percent divisions (deciles) of its area 


any two percentile points become greater as we move from the median 
(Pso) toward the extremes. Inspection of the curve shows, for instance, that 
the distance on the base line (representing scores) between percentiles 5o 
and 60 and the distance between 50 and 40 (these being at the center and 
equal) are smaller than that between any other intervals of ten per- 
centile points. What this means in the practical interpretation of test re- 
sults is that at, and close to, the median, differences in scores between per- 


centile ratings are smaller in the measured characteristic than they are 


between the same percentile differences elsewhere on the curve. See, for 
example, the spread of the base line 


between 50-60, and that between 
80-9o, or go-100. Yet each of these represents ten percentile points. 

The percentile technique has the advantage of being easily calculated. 
easily understood, and of making no assumptions with regard to the 
characteristics of the total distribution. It answers the question: “Where 
does an individual’s score rank him in his group?” Or: “Where does an 


individual's score rank him in another group whose members have taken 
the same test?" 


2 H H H H H 
Occurrence of a rectangular distribution is extremely improbable. 
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Figure 6.3 is an example of the cumulative frequency curve, also known 
as the percentile curve. The cumulative regnenGes we pioued, rather 
than the separate frequencies for each class interval. The poms What wet 
joined on the curve represent the upper limit of each interval. From such 
a curve it is readily possible to read off for any given value the percentile 
equivalent of its frequency distribution. Conversely, it is possible to read 
off the score equivalent for any given percentile. 


65 
FREQUENCY DISTRIBUTION 60 
(cumulative) 55 
Score i£ cf o 50 
z 
98-102 1| 65 iw 45 
93-97 6 64 g 40 
88-92 5 58 w 35 
83-87 12 53 S 30 
78-82 10 41 E 25 
73-77 15 ET E ms 
68- 72 8 16 215 
63- 67 6 8 i 
58-62 2 2 


Fic. 6.5. Cumulative frequency curve. From E. F. Lindquist, 4 First 
Course in Statistics, By permission. 


Decile Rank. The decile rank is the same in principle as the 
percentile; but instead of designating the one-hundredth part of a dis- 
tribution, it designates the one-tenth part of the group (N/10) in which 
any tested person is placed by his score. The term “decile” is used to 
mean a dividing point. “Decile rank” signifies a range of scores be- 
tween two dividing points. Thus a testee who has a decile rank of 10 (Duo) 
is located in the highest 10 percent of the group; one whose decile rank is 
9 (Ds) is in the second highest 10 percent; one whose decile rank is 1 (D), 
1s in the lowest 10 percent of the group. i 

When the number of scores in a distribution is small, percentiles are not 
Used, because there is little or no significance in making fine distinctions 
In rank. The decile-ranking method may be used instead.? 

* When a rather coarse classification will serve the purpose, the quartile rank (N/4) 


Or the quintile rank (N/5) may be used. As these terms indicate, they show, respectively, 


the one-fourth, or the one-fifth part of the distribution of scores in which an individual's 
Score places him. 
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The Standard Score and Variants 


Standard Score. The meaning of this index (z) is less obvious 
than that of percentile and decile ranks, although it, too, designates 
individuals position with respect to the total range and distribution o 
scores. The standard score indicates, in terms of standard deviation, how 
far a particular score is removed from the mean of the distribution. The 
mean is taken as the zero point, and standard scores are given as plus or 
minus. If the distributions of scores of two or more tests are approxi- 
mately normal, standard scores derived from one distribution may be com- 
pared with those derived from the others. 

The formula is: 
|X—M 
SD 
in which X is an individual score, M is the mean of the distribution, and 
SD its standard deviation. 

Assume, for example, that the mean IO of a group is 100 and that the 
standard deviation is 14. In this distribution an individual reaching an 


IQ of 114 has a z-score of +1.0, Another individual having an IQ of 79 
has a z-score of — 1.5. 


z 


Standard scores must ultimately be given percentile values to express 
their full significance. Since the number of case 
given number of standard deviations in a norm 
matically fixed, it is always possible to transl 
value. Thus, a person having a z- 
approximately 84; that is, 


s encompassed within a 
al distribution is mathe- 
ate a z-score into a percentile 
score of +1.0, has a percentile rank of 
his score surpasses 84 percent of the scores in 


TABLE 6.4 


PROPORTIONS OF CASES OR AREA UNDER THE CURVE, 
CORRESPONDING TO GIVEN STANDARD SCORES 


Cope a 


Approximate percent 
z-score 


of cases 

from the mean 
o —— —— 
50 19 
1.00 34 
1.25 39 
1.75 46 
2.00 48 
3.00 
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the group. The person having a z-score of —1.5, has a errem ccu 
approximately 7, surpassing only 7 percent of the scores. Table 4 , 
several standard scores and their percentile values, for illustrative I 
poses. Figure 6.4 shows the percentage of scores (or cases) amar si 
and below standard scores, with their corresponding percentile 
values, 

As an index of relative rank, the standard score : — Meis 
PSychologists because it is a well-defined pes rn MN 
Tepresenting a fixed and uniform number of units g 


"under portions of 


the normal curve 2.14% 013% 
STANDARD i Wosc te Fis 
DEVIATIONS geim m © g i à 
p 7 T% 999% 
PERCENTAGES Ol% 2.3% 59% 500% " pi | 
ROUNDED | 2% uie i | | 
PERCENTILE 


EQUIVALENTS 1 | 5 10 |20|30 405060 TO|80 90 95 | 39 | | 


| | A 
TYPICAL STANDARD SCORES Q Md 3 =e =, ar 
Z-SCORES -40 -80 -20 =10 [] +1.0 x 
— RC PE REN ee 


70 80 

T-scores | 20 30 40 50 Ld | 
9 
7 

STANINES l [s Ts ie ais] ze | 4% | | 
PERCENT IN STANINE | 4% | 7% I2% IT% 20% T% 12% 7% 145 

130 
DEVIATION IQs 55 70 85 100 u5 


t al. (15). 
Fic. 6.4. The normal curve and derived scores. From H. G. Seashore et al. (15) 
Reproduced by permission. 


Percentile and decile scores, on the other hand, are ranks in a group and 
do not represent equal units of individual differences. rov 
The standard score principle has been utilized in ES eus in 
known as the deviation intelligence quotient. This index is exp EB 
a later section of this chapter; and it will be discussed again cU d th 
With two of the individual intelligence tests (Stanford-Binet and the 
Meese hrs M A variant of the standard score, known as a T-score, 
was suggested by McCall (13). Using the T-score method, the mean is set 
at 50, whereas in terms of standard score, the value of the mean is zero. To 
obtain a T-score, the standard score is multiplied by 10 and then added 
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to, or subtracted from, the mean T-score of 50. Thus, a standard score of 
+1.00 becomes a T-score of 60, while one of —1.00 becomes 40. In using 
this technique, the justified assumption is that nearly all scores will be 
within a range of five standard deviations from the mean. Since each SD is 
divided into ten units, the T-score is based upon a scale of 100 units, thus 
avoiding negative scores and, in most instances, fractions. 

It should be clear that this index, found for any individual, is relevant 
only to the distribution of scores of the group from which the values 
have been derived and with which his score is being compared. 

Stanine. Another variant of the standard score technique is 
called the stanine, a term coined by psychologists in the Army Air Force 
during World War II. According to this method, the standard population 
is divided into nine groups; hence, "standard nine" contracted to 
“stanine.” Excepting the ranks of stanine 1 (lowest) and g (highest), each 
unit is equal to one half of a standard deviation. A score of 5 represents 
the median group, defined as those whose scores are within +0.25 SD of 
the mean; that is, a range of a half-sigma at the center of the distribution. 
A rank of stanine 6 represents the group whose scores fall between +0-25 
sigma and +0.75 sigma (SD). The meanings of the other stanine rankings 
can be similarly determined in terms of standard deviations, except 1 and 
9, as already stated, since the former represents all scores below —1-75 
sigma, and the latter includes those above +-1.75 sigma (see Fig 6.4). 

This single-digit system of scores has certain advantages for machine 
computation; and it does eliminate plus and minus signs. Other than these 
considerations, it is difficult to find any advantage in its use in pref- 
erence to the others already described. 


In addition to understanding the meanings of the several indexes of 
variation described, it is essential to realize that basically all of them 
derive their significance from their relations to the percentile system. In 
other words, the psychologist and other users of test results will always 
want to have the answer to the question: “What is the percentile equiva- 
lent of a given standard score, or T-score, or stanine score?” Figure 6.4 
presents the normal frequency curve with each of the several derived 
scores and their percentile equivalents, as well as the relative positions of 
several deviation IQs, which are explained later in this chapter. 


Mental Age and Intelligence Quotient 


) Mental Age. This concept was introduced by Alfred Binet m 
1908 in conjunction with the first revision of his scale. In this scale a? 
in its later revisions, items are grouped according to age levels. For ex 
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ample, selected items, passed by a specified percentage of five-year-old chil- 
dren in the standardization sample are placed at the five-year level; items 
passed by a specified percentage of six-year-old children are placed at the 
six-year level.* 

'To determine mental age, in the 1908 scale, Binet adopted the follow- 
ing rule: the child was credited with the mental age of the highest year 
level in which he passed all test items, or all but one. To this basic level 
an additional year in mental age was added for every five items he 
passed in higher levels; but no fractional years were added for fewer than 
five items passed. The defect of this method was recognized, so that in his 
1911 scale, Binet modified his procedure in order to permit the addition 
of a fraction of a year for items passed. In the Stanford-Binet revisions, 
the testee ig credited with all items up through the age level at which he 
passes all. This is called the basal year. He is also credited with all items 
passed above the basal year. The sum of his basal plus the other credits, in 
terms of months, is his mental age." 

Mental age norms are also used with scales that are not arranged ac- 
cording to age levels. These are point scales that yield a score usually 
based on the number of items correctly answered. By means of a table of 
norms provided for the particular test being used, it is possible to assign 
an individual an age rating. Thus, on à point scale, an individual who, 
regardless of chronological age (CA), earns a score equal to the norm of 
the ten-year-old population sample, will have a mental age (MA) of ten, 
as determined by that test. ; 

In determining mental age, whether by using an age scale or a point 
scale, an individual's performance on a standardized series of test items 
is being compared with the performance of the average group of a rep- 
resentative sample at successive age levels. Hence, we define mental age 
as the level of development in mental ability expressed as equivalent to 
the chronological age at which the average group of individuals reach 
that level. For example, a child having an MA of eight, has reached the 
level of the average group of eight-year-olds in the standardization group.® 

At this point, our concern is only to define and clarify the mental age 
concept. There are several important psychological and measurement 
problems connected with this concept that are explained at several ap- 
Propriate points in later sections. 


j * The question of placement of items in an age scale, according to percentage passing, 
is discussed in connection with the Binet scale and its revisions. 
k sIn giving the test, the examiner continues upward in the scale until an age level 
is reached at which the individual fails all items. This is called the terminal year. 

° The actual methods of determining mental age used by Binet and in revisions that 


followed are explained in Chapters 8, 9, and 10, in which the scales themselves are dis- 
cussed, 
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Intelligence Quotient. The use of this index was first sug- 
gested by Stern (17) 7 and Kuhlmann (12) in 1912; but it was not actually 
employed as part of test findings and reports until 1916 when the first edi- 
tion of the Stanford-Binet scale was made available. The intelligence 
quotient, the ratio of an individual's mental age to his chronological age, 
is found by the formula: 


1Q = Ge (100) 


The ratio is multiplied by 100 to remove the decimal. 

An individual’s 1Q indicates rate of mental development or degree of 
brightness. If mental development keeps pace with one's life age,® the 
quotient is 100. If mental development lags, or is accelerated, the quotient 
will be less than or greater than 100, depending upon the degree of re- 
tardation or acceleration. 

It is clear that mental age alone does not adequately represent an in- 
dividual's mental capacity; for persons of different, at times widely dif- 
ferent, chronological ages may and do reach the same mental age at a 
given time. One of the values of the IQ, therefore, is to reflect these age 
differences; hence it is defined in terms of rate of mental development 
and, as an attribute, degree of brightness. 

In his volume accompanying the 1916 Stanford-Binet scale, Terman 
included a table showing the percentage of children at each of a number 
of IQ levels, each of which he gave a name; for instance, dull, normal, 
very superior. An individual whose test performance is normal for his 
chronological age earns an IQ rating of 100. 


IQ range — Terman's categories 


80-89 dullness 

90-109 average intelligence 
110-119 superior intelligence 
120-140 very superior intelligence 


As in any such table, the limits of each category were arbitrarily de- 
termined. These and similar categories are intended to serve only as 
guides in the interpretation of intelligence quotients and for purposes of 
statistical classification and analysis. 

There are a number of problems associated with the interpretation and 
use of the IQ that are explained in later sections of this and other chap- 


ters. Our purpose at this stage is primarily to define and explain the 
meaning of the concept. 


" 4 T 
un this publication, dated 1916, Stern refers to his suggestion of 1912. 
That is, until the age when maximum mental capacity is reached. 
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A Deviation IQ. In later chapters dealing with the Stanford- 
Binet scale, and other scales, it will be seen that the standard deviations 
of intelligence quotients obtained by the relation MA/CA are not always 
of the same, or nearly the same, size at all age levels. For example, at one 
age level, SD might be 12; at another, 16; at still another, 18. The reasons 


for these differences are given in later chapters; but now it should be 
blems and irregularities in 


noted that differences such as these create pro 
the interpretation of the relative meaning of a given IQ. Thus, for ex- 
ample, in the first instance, an IQ of 88 (—1 SD) signifies a percentile 
rank of 16; in the second instance, an IQ of 84 has the same percentile 
pap alent; while in the third, an IQ of 82 also signifies a percentile rank 
of 16, 

the "deviation IQ" is used with 
of the standard score (z) tech- 
deviation IQ can be shown by 
an illustration? For each indi- 
a weighted score by using a con- 
of the group is given a deviation 


In order to overcome this difficulty, 
Some tests. This index is an adaptation 
nique. The method of determining the 
Using the Wechsler test's procedure as 
vidual, the raw score is converted into 


versi < 
*rsion table. The mean weighted score | a 
Q value of 100; the standard deviation of the scores 1s equated with a 


deviation IQ value of 15. Thus, à person whose point score places him at 


—1 SD will have a deviation IQ of 85. One whose score is at —2 SD will 
itive SD values will give ranks above 


1Q; +2 SD equals 180; and so forth. 


he 1960 revision of the Stanford-Binet scale also uses the deviation 

Q, calculated by a different method; but the basic principle is the same. 
* principle is that an individual's intelligence quotient should be 
€termined by the relative extent to which his score on the test deviates 
rom the mean of his age group, and that an intelligence quotient of a 
Biven value should have the same relative significance throughout the 
age range, These ends are now achieved by using units of standard devia- 


tio 
n as the basis; hence the name of 
aking the mean score equal to 2 


the new index. > 
deviation IQ of 100 is readily under- 


Standable, since this value has long been conventional and is accepted as 
"epresenting the average or normal. It also appears that the most probable 
Standard deviation of intelligence quotients is 15-16, as found with the 
‘anford-Binet (which in many ways is regarded as a criterion); hence 15 
35 been taken to represent the standard deviation of € iow Siang 
; € choice of this value, therefore, was not an. wed z e. ia ore, 
.* distribution yielded by a standard deviation of 15 P very 


sim; «re and educators have become ac- 
ilar to the one to which psychologists ^ 

ion IQs of the Stanford-Binet and the Wechsler 
viation 
10 and 11- 


* Th, 
Scales € methods of calculating de 
are given more fully in Chapters 
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customed, and in which values at each of the several levels have - 
qualitative significance in regard to mental ability and educationa 
promise. 

The deviation IQ, furthermore, is especially useful at age levels above 
16 or 18 years. For these and older persons, the use of mental age and the 
formula for the ratio IQ (MA/CA) have been regarded as inappropriate 
and questionable by many psychologists. (This aspect of mental age is dis- 
cussed in Chapter 10.) 

It should be clear, from the materials thus far presented, that per- 
centiles, standard deviations, standard scores, and intelligence quotients 
are intimately related. Whatever index is used, its principal significance 
is found in the relative rank it represents and in its psychological, edu- 
cational, and vocational connotations. , 

Although the primary purpose of this section is to define and explain 
these concepts used in psychological testing, it is relevant here to em 
phasize the qualitative aspects of these indexes. 


Qualitative Aspects 


Assume that three boys, all of the same age, have been tested. 
Suppose that their intelligence quotients are 50, 100, and 150. Since these 
are numerical ratios (MA/CA x 100), it is natural to assume that they 
have a quantitative significance. So they do—for they indicate rate of 
mental development. But these quotients also have a qualitative sig: 
nificance—for, among other things, they indicate each boy's position in 
the “hierarchy of intelligence." If the measure of intelligence is valid, the 
boy having the IQ of 5o is seriously retarded and is in the lowest One 
percent of the population in respect to the psychological functions being 
tested; the boy with the IQ of 100 is the typical or average individual, 
midway up (or down) in the distribution of intelligence; and the boy 
having the IQ of 150 is very superior and belongs in the top percentile 
rank of the group. 

Qualitative significance of the intelligence quotient can be illustrated 
further by asking this question: Is the brightest of these three boys one 
pe Eus Nue as intelligent as the average boy, and three times as i?” 
d: kh pi o Ra I This question cannot be answered in ee 
less capable one is s apa 4 k^ Fr how many ape y. more capab M 
ENS OR ULM d ot ers, because the IQ is not a percent. is 
qualified school or clinical E y US aat ip Pumps EN 
inferences from each boy's js ae il pai a SE page 
ng, extent and level of ed igs = aca ae me osos ro 


ucability, vocational ibiliti d levels, 
and probable types of interests. j al possibilities anı 
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The boy with an 1Q of 50 probably will not be able to complete more 
than the second grade; the boy having the IQ of 100 should be able to 
complete twelve grades; the boy with an IQ of 150 will be able to progress 
in education as far as his interests and motives indicate. Obviously, too, 
the kinds of occupations that will be open to the first boy are very limited; 
those open to the second will be numerous; those open to the third will be 
practically unrestricted, so far as mental capacity is concerned. And the 
same may be said of the range of interests in general that will be within 
the scope of each. These facts are of educational and clinical significance, 
but at present there are no psychological or statistical means whereby 
one can calculate how many times more or less capable one person is than 
another. 

Caution is necessary at this point. The inferences drawn in the pre- 
ceding paragraph cannot be based solely upon the numerical IQ value 
without reference to the clinical features in the test performances or other 
factors not shown by the numerical index. We have assumed that there 
are no complicating factors and that the IOs are valid measures of the 
capacities and performances of the three boys. The boy with 150 IQ, 
however, might be an unstable personality who is failing in many or all 
of his school subjects. The boy with 100 IQ might have been penalized 
on the test by a language handicap. And the boy with an 1Q of 5o might 
show a "scatter" (inconsistency and variation) of performance indicating 
emotional disturbance rather than intellectual impoverishment. Occasion- 
ally, also, it will be found that a high test rating may be attributable to 
an inconsistently high level of performance on one or a few types of sub- 
tests (for example, memory span or word knowledge), just as, conversely, 
it occasionally happens that a person's IQ is depressed by an incon- 
sistently low performance on one or a few subtests. "Inconsistent" means 
that the individual's levels of performance on these few subtests differ 
markedly, in one direction or another, from the general and more con- 
sistent levels of his scores on the other subtests. 

It is to be noted that the possible vitiating factors mentioned in the 
preceding paragraph are of the type to which the experienced and qual- 
ified psychological examiner will be alert. These precautions do not 
signify that all or most intelligence test ratings are affected by these and 
other contingencies. 


Indexes Used with Educational Achievement Tests 


Educational Age. The educational age index (EA) represents 
a pupil' average level of achievement in a group of school subjects, 
measured by means of standardized tests, and in terms of the average for 
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tional ages are obtained; therefore they are not all necessarily oe 
Furthermore, even if achievement in the same school subjects is measur ec , 
but with different tests, the EAs might still not be comparable because 


Educational age is used, at times, to estimate the probable grade level 
at which a pupil's test performance places him, since the average age a 
each grade is known. This practice, however, is of doubtful merit. pm 
grade estimate or comparison is wanted, then grade norms should be 
educational achievement tests. 

Educational Quotient. As was to be expected, an “age” would 
be accompanied by a quotient. The educational quotient indicates, pre- 
wledge of a group of school subjects is 
gical age, or whether it is above or below 

for his age. The simple formula, there- 


Achievement Quotient. 
suggested in 1920 (7), is now rarely used.10 It is found by dividing educa- 
tional age by mental age (EA/M. 
divisor, instead of CA, is that the 
index of a pupil's learning capacity. Hence 
EA by MA yields a quotient that indicates w 
is working up to his mental capacity, 
Although it is true that mental a; 


» it was believed, dividing 
hether or not the individual 


first grade at the age of 7; he has a mental age of 10, and, thus, an IQ 
of 145. To set an AQ of 100, h 
obtain this EA, he mus 


10 Loy 2 " " TH i 
Although this index is DOW in disuse, it is mentioned here because students will 
€ncounter it, from time to ti i 


me, and because it has a place in the history of testing. 


CLINICAL ASPECTS 137 


the ordinary child is expected to acquire in four. It is not probable that 
this will happen. To generalize this point: the superior child, especially 
in the lower school grades, has not had, and frequently does not have, 
the time and length of schooling necessary to learn the amount of subject 
matter necessary to equate EA with MA. 

A second defect in using the AQ is that frequently the population 
samples upon which the educational achievement tests have been stand- 
ardized are not comparable with those upon which the norms of the 
intelligence tests have been based. Generally, the former are less repre- 
sentative of the population and are dependent, of course, upon the qual- 
ity of the schools in which the standardization process was carried out. 

A third defect is the fact that many achievement tests do not differ- 
entiate as well among pupils as does a sound test of general intelligence. 
This fact tends to reduce the variability of the former and its correlation 
with the latter. 

Currently, for the purpose of indicating a pupil's school achievement, 
the EA and EQ have value; but they should be supplemented by each 
individual's percentile rank within the distribution of scores for his 
grade. Since all sound tests cover a range of several grades, it is possible, 
if necessary, to compare any individual's score with the norms in grades 
above or below his own, for the purpose of finding his percentile rank 
within: those other levels,2 


Clinical Aspects 


Scores, whether raw or converted, do not suffice for the com- 
plete interpretation of an individual's performances on psychological 
tests. The several aspects of test standardization thus far presented are 
concerned with the performance of groups of persons and with average 
relationships revealed by statistical treatment of results. It happens, 
however, that although certain types of test items meet some or most of 
the statistical requirements of validity, they are unsatisfactory as indica- 
tors of intelligence when used for clinical purposes. For example, on the 
Stanford-Binet scale, the percentage of adults able to repeat eight digits 
forward (digit-sban test) is approximately the same as the percentage who 
can solve one of the more difficult reasoning problems. Yet, in clinical 
examinations, psychologists find some adult mental defectives who can 
pass the former test, although a mental defective can never succeed with 
_ “At present, educational ages and grade norms have less significance than formerly as 
indexes of the quality and level of a child's educational achievement. This is the case 
because nearly all children and adolescents now remain in school much longer than 


they did formerly, and, regardless of quality or level of achievement in school subjects, 
are moved up through the grades. This practice, obviously, lowers the normative levels. 
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the latter. What this means is that statistical validation of a test item is 
not always sufficient; it must be supplemented by the pragmatic criterion 
of use with a wide variety of individuals in a variety of situations in 
order to show whether or not it has discriminative value among individ- 
uals at the several levels of ability. 

Psychological tests, as already noted, are standardized on the basis of 
the performance of a representative population; and an individual's 
rating is determined by the relationship of his performance to that of 
a group as a whole. Thus we have the several "ages" (for example, mental 
age) and “quotients” (for example, intelligence quotient), percentile and 
decile ranks, and standard scores. Any useful test should yield one or 
more of these. In more recent years, however, without denying the use- 
fulness and value of these indexes of relative status, increasing emphasis 
has been placed upon "patterns" of performance as clinical aids to psy- 
chological diagnosis and counseling. 

A person's responses to tests are now frequently analyzed for the pur- 
bilities or disabilities, 
sponses on some types 
or whether certain psycho- 
markedly superior to others 


gories of maladjusted and ab- 
rning more clearly the mental 


Also, it has been found that 
status may have different 
sum, nevertheless, 


persons of equivalent general mental 
patterns of performance, or abilities, which in 
: give them much the same over-all and general ratings 
In terms of a single index (mental age, percentile rank). That is to say, 
it is possible for two persons to have test ratings that are numerically 
similar and yet have dissimilar "mental organizations," since the com- 


ponents of each total rating differ to a Breater or lesser degree from 
those of the other. 
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idiom, in order to discern his particular form of mental organization, 

specific evidences of retardation or disability, if any, and details of his 

development. 

In more recent years there has been a partial shift in emphasis from 
almost exclusive concern with the analysis of abilities and methods of 
psychological measurement, as such, to an examination of individual 
performance and individual idiom, and to the individual as a function- 
ing and dynamic unit. After all, any given test measures only a segment 
of a total personality; that segment is an integral part of the totality and 
is influenced by the whole. Hence, the psychologist who is concerned 
with insight into the nature of an individual's abilities must be able to 
evaluate a person's performance as well as measure it. The data and 
indexes derived from psychological tests are, for the most part, objec- 
tively determined; but their clinical use involves judgment, subjective 
assessment, and interpretation, based upon a variety of data from several 
sources. The experienced clinical examiner will supplement the test's 
numerical results with his observations of the testee's attitudes during 
the examination and the manner in which he attacks the problems of the 
test: his degree of confidence or dependence, his cooperativeness or 
apathy, his negativism or resentment, the richness or paucity of his re- 
sponses. The individual test situation thus can be, in effect, an occasion 
lor general psychological observations—really a penetrating psychological 
interview. 

Ability not only to score a test but also to assess and interpret responses 
and to evaluate the individual's behavior during the examination 1$ à 
clinical skill the psychologist develops from working with persons rather 
than with tests alone. However, for the practice of his skill he must, of 
course, thoroughly understand the psychological and statistical founda- 
tions and hypotheses upon which the tests are based. j [ 

A few specific instances of the qualitative analysis and interpretation 
of test responses will illustrate the kinds of observations that constitute 
the clinical aspects that supplement numerical scoring. 


Word definitions are generally acceptable at a fairly elementary level; 
but they vary in level and quality from purely concrete, to functional, to 
conceptual or abstract. Differences in quality level are indicative of differ- 
ences in modes of thinking. It also happens, at times, that some words are 
emotionally charged for the examinee, in which case his definition and be- 
havioral response may be revealing. 


Some test items permit the exercise of considerable freedom in response. 
'These responses may reveal the examinee's attitudes, values, and modes of 
meeting life situations. In this category are test items that ask, "What is the 
thing to do when . . . ?" Or, "Why should we . .- ?" The subject's re- 
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i i izati i ki re- 
actions to such items, the qualities of his verbalizations in making the 


i ve the 
sponses, and the presence or absence of strong feelings reveal some of 
? : 
nonintellective aspects of his personality. 


The subject's s i ing a ossible 

j t's pecific comments while performing a task are of a iis 

g i regar is atti e ar imse T yard an authori 
significance in ard to his attitude toward hims If, or toward an y 


figure (the examiner), or toward other individuals and institutions in hi 
environment. 


Responses to items or random comments may reveal hostilities and anx- 
ieties, or wholesome cooperativeness and security. 


The manner of speech—the use of expletives, halting and fumbling, xe 
less movements, blushing, or, on the other hand, a relaxed attitude, mild 


criticism of one's own performance—provides valuable clues to the testee's 
personality. 


Character disorders may be indicated by impetuous and uncritical re- 
Sponses that are incorrect but are given with assurance and pretentiousness. 


The subject's ability to direct his attention toward, to concentrate upon, 
and to organize a task are often 


revealed by his mode of approach to a test 
problem. 


The selective character, if any, of a person's vocabulary and information 
(two subtests widely used) 


will shed light upon his experiences, interests, 
cultural background. 


A personality trait such as compulsiveness ( 
oughness and self-criticism) may be revealed b 
and by numerous and unnecessary alternative 


as opposed to desirable thor- 
y excessively detailed responses 
responses. 

Some types of responses indicate 
ously bizarre responses b 
London is in Africa; 
disjointed and irreleva 
or problem. 


pathological or psychotic states: errone- 
y an otherwise intelligent person (for example, 
the population of the United States is 1,500,000); 
nt responses; and distorted interpretations of the task 


Organic damage may be detected through selected kinds of subtests; for 


example, disturbance of the visual-motor function as indicated by the dia- 


mond copying test (Stanford-Binet) and the object assembly test (Bellevue), 
among others, 


Scatter analysis (discus: 


sed in detail in Chapter 14) is essential to the 
discernment of superior, inferior, and impaired psychological functions. 


Sensitized observations on the 


part of the examiner will enable him, in 
general, to evaluate how the subje 


ct proceeded in both success and failure.!? 
The findings on a test—whether it is of general intelligence, specifi 
aptitude, personality, or school learning—indicate the present status O 
* Ilustratio 


. 1 * n " " H n 
oe ns of these qualitative interpretations will be given in Chapter 14, ° 
clinical aspects, 
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each person examined. They do not, however, tell the psychological ex- 
aminer by what course the person arrived there; nor do they indicate 
specifically what factors were operative in his development. The clinical 
approach, while accepting and utilizing standardized tests and norms, 
insists upon viewing and evaluating any given individual's performance 
and status in the light of a variety of other measurements, observations, 
and activities of that individual, and upon interpreting the objective 
quantitative data according to the part they have in the total. For in- 
stance, children who have suffered from prolonged and serious nutritional 
disturbances or deficiencies or who are suffering from severe anemia will 
appear listless, apathetic, and deficient in mental capacity. Other chil- 
dren, of apparently retarded mental development, may be suffering from 
serious deficiencies of vitamin B complex, while still others may measure 
at a level of retardation because of emotional pressures and “blocking.” 1 
Furthermore, the performance of some children on standardized tests is 
inferior because they developed under conditions of psychological im- 
poverishment, whereas test norms are based on the assumption that all 
individuals being examined have had approximately equal opportunity, 
in the grosser aspects of environment, for mental development. Often, 
of course, that is not the case. These facts, and others of the same kind, 
indicate that in the case of some individuals, performance and conse- 
quent relative status may be impaired by nutritional deficiencies, by 
emotional handicaps or by other unfavorable environmental conditions. 

The fact that psychologists are placing increased emphasis upon in- 
dividual patterns in test performance (especially in diagnosing cases of 
behavior and educational maladjustment) and upon the individual as a 
whole does not mean that statistical and group studies are unnecessary. 
Such studies are essential in providing norms against which any individ- 
ual's performance may be projected, in giving more precise meaning and 
significance to any single score, in demonstrating the great range of hu- 
man variability in any trait or function, and in providing the means of 
more precise study of interrelations among psychological traits and func- 
tions. 


Difference between Norms and Standards 


Norms, as already explained, are average scores or values deter- 
mined by actual measurement of a group of persons who are represen- 
tative of a specified population; for example, all 12-year-old boys, all 
fourth-grade children, all native-born male adults. Norms, therefore, are 


. 7A condition in which the functioning of mental abilities is impeded because of the 
individual's emotional state or a mental conflict. 
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averages obtained under prevailing conditions—good, poor, or indiffer- 
ent. These norms may well ". . . reflect all the sins of omission or com- 
mission in their [the people's] nurture and must be critically examined 
lest we set up as desirable norms for achievement what are but accidental 
outcomes of our unsystematic and unenlightened nurture of chil- 
dren . . ." (6, p. 9). In other words, a norm of psychological perform- 
ance or of a physical trait is not necessarily one with which we should 
be satisfied; for it reflects development under conditions that may be, 
and often are, much less than optimal. As an example, consider the 
average vocabulary of eighth-grade pupils. The average, or norm, of this 
group will depend in part upon these pupils' opportunities for the ac- 
quisition and use of language from earliest childhood. 
tunities might have been extremely poor, moderately sat 
good, or at any other level between these three. Norms of 
respect to some of the Psychological processes measured by means of 


tests of intelligence and of specific aptitude are likewise dependent upon 
conditions and opportunities 


Their oppor- 
isfactory, very 
performance in 


It is necessary, 
hand, and standar 
objective, which 


; rms; that universal op- 
Pay nutrition would raise age norms for height and weight, and so 
orth. 
Psychological tests 
measure traits and functions as they exist under 
present condi i i 


might rai 


the minds of investigators, 
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parable with or similar to the values obtained in the measurement of 
physical traits or phenomena, as, for example, length, weight, or light 
intensity. In the physical realm, the units of measurement are fixed and 
constant throughout the entire scale. An inch, a pound, a candle power— 
each has the same value and physical significance at whatever place on 
the scale it is measured. Psychological measurement, by contrast, is more 
difficult and is confronted by special problems. In the first place, it is not 
possible to determine the inherent difficulty of an item in a psychological 
test in terms of constant units, as it is possible to find the length or weight 
of any object. Whereas in the measurement of physical phenomena it 
can be found that a given object is of X length and Y weight, and hence, 
let us Say, twice as long and three times as heavy as another object with 
which it is being compared, no such direct measurement and comparison 
are possible in psychological testing. In this realm, the measurement value 
Of a test item is dependent basically upon the percentage of persons able 
to pass the item in the population group for whom the test is intended. 

If in the testing of a particular ability, one item is passed by only ten 
percent of a group, whereas another item is passed by fifty percent, it can- 
not be said that the first is five times as difficult as the second, because 
"percent passing" is not a unit in the sense that an inch or a pound is. 
What can be said is that an individual able to deal successfully with the 
first item belongs in the highest decile group, while one who cannot pass 
the first but is able to pass the second falls at the midpoint or average 
level of the group in respect to that item of the test. This interpretation is 
Significant psychologically and educationally. Or, to use another instance, 
In the case of the Stanford-Binet scale, the age level at which an item is 
Placed—and hence its value in the scale—is determined by the age group 
ın which average individuals pass that item. Thus, anyone able to solve 
à reasoning problem placed at the 10-year level, for example, but failing 
to pass reasoning problems at the 11-year or higher levels, may be said 
to have typical 10-year-old ability in respect to that mental task. 

This leads directly into the problem of the meaning of "mental age" 
and other "age" units. Mental age is an index showing one's level of 
menta] development, corresponding to the level of mental development 
of average persons of the coinciding chronological age. Thus, if a child's 
mental age is 16, he has reached the level of mental development attained 
by the average 10-year-old child, regardless of the actual life age of the 
child being tested. 

. “OW, suppose there are four individuals having mental ages, respec- 
d of 5, 6, 12, and 13. Is the difference between mental ages 5 and 6 
TEN E as that between mental ages 12 and 13? It is not; for, as meas- 

Y mental tests, the rate of mental development at 5 and 6 years 
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of age is more rapid than subsequently; hence, the increment between 
the earlier years is greater. Figure 6.5 is the curve accepted by most 
psychologists as representing rate of mental development. The outstand- 
ing feature of that curve, for the problem under consideration, is the 
fact that it rises at a decreasing rate with increasing age. The curve is 
"negatively accelerated." Or, psychologically speaking, with each succeed- 
ing year, until maximum development is attained, the amount of incre- 
ment in mental growth is less than in the preceding year. 


15 
12 


o 


MENTAL AGE 
TO Qi 010-00 


0123456 7 8 9 IOl 213 4 i5 l6 
CHRONOLOGICAL AGE 


Fic. 6.5. Hypothetical curve of mental growth, illus- 
trating decreasing yearly increments 


It is obvious, therefore, that each successive year of mental age added 
to an individual's level represents something less in measurable growth 
than the preceding year's increment. In other words, mental age units 
are not uniform; they rank an individual with respect to the average 
mental capacity of an age group. The same principle—nonuniformity of 
age units—applies also to all other types of psychological tests that trans- 
late their scores into age equivalents. 

Although psychological testing would be easier and more precise if its 
measuring units were fixed and uniform, available indexes of relative 
rank are, nevertheless, essential in evaluating an individual's mental 
development, his educational progress, his particular aptitudes, his social 
maturity, and even certain nonintellective aspects of. personality. Experi- 
mentally, too, these same indexes have been indispensable in studying 2 
host of practical and theoretical problems, such as sex differences, effects 
of environmental conditions, inheritance of intellectual capacity and of 
special aptitudes, occupational differentiation, racial differences, relation- 
Ships between physical and mental development, and others. 
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Essential Considerations in Selecting a Test 


By way of summary, the following factors are given as those to 
be considered in selecting a psychological test. 

Norms. The test must provide appropriate and accurate norms, 
whether they be in the form of age, grade, percentile rank, standard 
score, or any other type. Norms should be meaningful with regard to 
the purposes for which the test is intended and to the groups of persons 
with whom it is to be used. 

ADMINISTERING AND ScoRiwc. The procedures of administering the 
test should be objective and the test items should be amenable to rela- 
tively objective and simple scoring, insofar as the nature of the instru- 
ment permits. Individually administered tests, like the Stanford-Binet, at 
umes require insightful judgment in scoring responses. Interpretation 
and evaluation of responses are even more significant in the scoring and 
analysis of projective techniques for assessing personality. 

Time Requirements. The length of the test should not be so great 
as to produce boredom, satiation, or negativism; for when these set in, 
the subject does not perform at his best level. Specific time limits cannot 
be prescribed for all tests or for all types of testes; but in general, shorter 
ume requirements are indicated for younger children and for the men- 
tally retarded. In the case of both of these groups of subjects, the atten- 
tion span is relatively brief; hence, it may be necessary at times to com- 
plete an examination in two sessions. p 

INTEREST LEVEL. Test items should be of sufficient interest to motivate 
the individuals for whom they are intended. Particular items and types 
of problems devised to measure given functions must be suitable to the 
age levels of examinees. Thus, in constructing an intelligence test for 
the entire range of adult capacity, from very low to very high, it is neces- 
sary that even the items placed at the very low levels be of the sort that 
will interest an adult rather than a child, even though these low-level 
adults may be inferior in mental capacity to some children. 

Tue POPULATION SAMPLE. The manual of a test should state in detail 
the nature of the population sample on which the instrument was stand- 
ardized and upon which norms are based. The information given should 
include the following: total number of cases, age range and number at 
each age level, number of each sex, geographic distribution, socioeconomic 
eee number in each category. For some tests, it will be relevant, 

necessary, to have information on some of the following: school- 
grade distribution, number of years of schooling completed, amount and 
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kind of special training (especially for tests of specific aptitudes), special 
or abnormal adjustment problems and history (especially for tests of 
personality). In short, the prospective user of a test must be certain that 
the test has been standardized on an appropriate sample of the population 
and for the same or similar purposes as those contemplated by the pros- 
pective user. This principle seems axiomatic; yet it is not always given 
due consideration. 

Tue Functions OR Traits MrasunED. The test manual should not 
only state the purpose of the instrument; it should also provide, so far as 
possible, an analysis (psychological and statistical) of the functions or 
traits being measured. 

RELIABILITIES. Coefficients of reliability should be provided not only 
for total scores but for part scores as well, wherever possible. Also, re- 
liability coefficients are desirable for each of the several age levels and 
ability levels included within the range of the test. Furthermore, the 
manual should state which method or methods have been used in cal- 
culating the test's reliability. Here, the prospective user of a test must 
look for information that will help him answer the question: "Reliable 
for whom and for what purposes?" 

Varr. Data on validity are of several kinds, in addition to co- 
efficients of correlation; for example, expectancy tables, known groups, 
significance of differences between age levels. The test manual should 
explain the characteristics of the criterion groups, the nature of other 
criteria used, the validity of the total test, and the validity of the sub- 
tests. It is desirable, also, to have data regarding validity at each of the 
several age and ability levels. Here again, an answer must be sought to 
the question: “Valid for whom and for what purposes?" 

Reports OF EXPERIMENTS. The ideal test manual (subsequent to the 
first or earliest editions) includes summaries, findings, and interpretations 
of the most Important experimental studies to which the test has been 
subjected by psychologists. Such information will help users to understand 
more fully the nature of the test and the factors affecting performance 
on it, thus making for sounder interpretation of results obtained by those 
who use it. For example: What is the influence of cultural factors? Of 
practice? Of time limits? Of psychotherapy or counseling? 


E ological] tests are scientifically constructed instruments based upon 
pe 2 ogical and statistical principles. Familiarity with these principles 
provide students with a sounder comprehension of both the values 


pe NICA. 

Hs poda of tests than they would derive from using and inter- 

Ex ing em in a mechanical manner. It is also true that human subjects 
eing tested; they do no 


t behave like mechanisms under complete 
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control, with all environmental forces likewise under control and meas- 
urable. On the contrary, human behavior is often subtle and the psycho- 
logical forces motivating or influencing persons in a test situation may 
be elusive and difficult of evaluation. Furthermore, since the quantitative 
data of psychological testing are not as definite, precise, and uniform as 
are the data of physical measurements, the interpretation of test findings 
is more difficult. For these reasons, we have emphasized not only the 
well-defined scientific principles and procedures of testing, but also the 
qualitative and clinical aspects that are essential if test findings are to be 
of the greatest value to the individuals examined. 
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DEFINITIONS AND ANALYSES 
OF INTELLIGENCE 


Psychological testing began, it will be recalled, with. efforts to 
devise scientific instruments for the measurement and study of individual 
differences in intelligence. Measurement and analysis of this complex 
mental process has continued to be the most important and widespread 
type of psychological testing. It is desirable, therefore, to examine the 
definitions and theories of intelligence, both for their historical value 
and their current significance in test construction and utilization. Knowl- 
edge of these will give the student a fuller understanding of current tests. 


Definitions 


Three Types. A variety of definitions have been given by 
psychologists; but as a matter of fact, each can be classified into one of 
three groups. 

One group of definitions places the emphasis upon adjustment. or 
adaptation of the individual to his total environment, or to limited 
aspects of it. According to definitions of this type, intelligence is general 
mental adaptability to new problems and new situations of life; or, 
otherwise stated, it is the capacity to reorganize one’s behavior patterns 
SO as to act more effectively and more appropriately in novel situations. 
Thus, the more intelligent person is one who can more easily and more 
extensively vary his behavior as changing conditions demand; he has 
numerous possible responses and is capable of greater creative reorganiza- 
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tion of behavior, whereas the less intelligent person has fewer — 
and is less creative. The more intelligent person, accordingly, can gens 
with a greater number and a greater variety of situations than the "n 
intelligent; he is able to encompass a wider field and to expand his area 
of activity beyond that of the less intelligent. . 7 

A second type of definition states that intelligence is the ability to 
learn. According to this definition, then, a person's intelligence is a 
matter of the extent to which he is educable, in the broadest sense. T he 
more intelligent the individual is, the more readily and extensively is 
he able to learn; hence, also, the greater is his possible range of experi- 
ence and activity. 

Still others have defined intelligence as the ability to carry on abstract 
thinking. This means the effective use of concepts and symbols in dealing 
with situations, especially those presenting a problem to be solved through 
the use of verbal and numerical symbols. Binet's conception of intelli- 
gence belongs largely in this category, for he maintained that it is the 
capacity to reason well, to judge well, and to be self-critical. » 

It should be apparent that the three foregoing categories of definitions 
are not, and cannot be, mutually exclusive. For the most part, their 
authors differ in emphasis. Obviously, ability to learn must provide the 
foundation for adjustment and adaptation to changing or new condi- 
tions. And a person may be expected to have learned more or less from 
situations he had encountered and to which he had previously made 


adjustments. For if this were not the case, he would have to start anew 


in every situation which confronted him; there would be no difference 
between the behavior of an ex 


perienced person and that of a novice. 
There are, of course, individual differences in respect to learning 
capacity and in ability to retain, interpret, organize, and apply what has 
been learned; thus previous experiences will have different significance 


and different learning value for different persons. And it is learning 
Capacity that constitutes the b 


though, as will become appare 
lective factors affect adjustmen 


Yet learning capacity, in the sense only of acquisition of information 
and knowledge, is not a s 
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situations and a definition of intelligence as the ability to learn represent, 
in fact, two aspects of the same process. 

The third type of definition is also inseparable from the other two. 
A person learns abstractions—principally verbal and numerical—through 
experience, through contact with and perception of the objects, events, 
qualities, or relationships for which the symbols stand. Thus, the word 
"dog" has meaning for a child because it has come to represent a class of 
objects with which he has become familiar. The word "green" represents 
à quality he has perceived as an aspect of a variety of objects. The word 
"charity," for the individual who has developed sufficiently to under- 
stand the concept, has a certain connotation because he has experienced 
events that have been labeled as charitable. The number “five” is mean- 
ingful to a person when, as a result of experience with concrete objects, 
he apprehends the term as representing not only ordinal position but 
summation as well. Furthermore, if it is to be said that an individual 
has fully learned to deal with the symbols of abstraction, then it must 
be true that he understands that the word is not the thing or the quality 
for which it stands. He understands that words and numbers are abstrac- 
tions that represent objects, events, qualities, or relations, but which, in 
thinking, can be dealt with as if they were the things themselves. This 
aspect of intelligence—the ability to use symbols—is itself the result of an 
individual's development and learning. And in turn, the mastery and 
utilization of symbols promotes further learning—for it is hardly neces- 
Sary to labor the point that without language and number, the range of 
one’s learning would be seriously restricted. 

Ability to carry on abstract thinking, it is easy to see, contributes to a 
person's ability to adjust or adapt to changing or new situations, because 
through the use of symbols we are enabled to think through a problem 
Without spending time and effort on sheer trial and error in action; we 
can marshal, evaluate, and deal with past experiences; and we can project 
our thinking forward. In other words, through the use of symbols and 
abstract thinking, man is able to enlarge his range of behavior, to extend 
his horizons, and to transcend the immediate concrete and specific situa- 
tion, 

Two Comprehensive Definitions. Two definitions of intelli- 

Sence have been presented which, in effect, combine and extend the 
three views already presented. One writer states: “Intelligence is the 
aggregate or global capacity of the individual, to act purposefully, to 
think rationally and to deal effectively with his environment” (19, p. 9). 
he reader can readily compare this definition with those already pre- 
sented and analyze it with a view to discerning similarities and differ- 
€nces. It will be noted, of course, that this definition encompasses the 
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other three. Although learning ability is not mentioned, it is surely im- 
plied. Two new aspects, however, are added. The definition specifically 
states that an individual's intelligence is revealed by his behavior as a 
whole ("global"), and that intelligence involves behavior toward a goal, 
which may be more or less immediate (“purposefully”). A third aspect is 
presented by the author in his elaboration of the definition; namely that 
"drive" and "incentive" enter into intelligent behavior. This aspect is 
probably included and implied in capacity “to act purposefully” and “to 
deal effectively" with one's environment, as stated in the definition. 

The inclusion of “drive,” “incentive,” and the like as aspects of intelli- 
gence is of doubtful validity; their inclusion would confuse the issue, the 
testing instrument, and the results obtained. It is true, of course, that 
effective utilization of a person's intelligence depends upon the extent 
and degree to which he employs it. Nevertheless, a single testing device 
that attempts to combine the measurement of intellectual with non- 
intellectual traits without providing for differentiation between the two 
would not succeed adequately in either respect.! This is not to say that 


in assessing an individual's intelligence and personality as a whole we 


should ignore “drive,” "incentive," "interest"; for the competent psycho- 


logical examiner does evaluate these and other nonintellectual traits in 
presenting his test results. Furthermore, as will be seen in later chapters 
of this book, special psychological instruments are available for the eval- 
uation of nonintellectual traits of personality, which the clinician may 


use to supplement results of intelligence tests if he believes they are 
necessary, 


Stoddard offers the following definition: 
undertake activities that are characterized b 
(3) abstractness, (4) economy, 
and (7) the emergence of origi 
conditions that demand a co 
emotional forces" (8, 
definition does in fact 


"Intelligence is the ability to 
y (1) difficulty, (2) complexity, 
(5) adaptiveness to a goal, (6) social value, 
nals, and to maintain such activities under 
ncentration of energy and a resistance to 
P- 4). Here again, the reader will note that this 

include the first three types presented; but it goes 
beyond these in several respects. The author specifies the several attributes 
of intelligence, and in his enumeration are several not included in earlier 
definitions. 


Degree or level of difficulty is implied in all definitions; but Stoddard's 
ontribution here lies in the fact that he rightly insists we must, in 
testing, distinguish between true differences in degree of difficulty and 


possibility of drawing 
as the Stanford-Binet an 
cteristics, 


a scale such 


n individual's performance on 
lectual chara, 


d the Wechsler regarding some of his nonintel- 
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differences that only seem to exist, as between two or more test items, 
whereas, in fact, there are no inherent differences in difficulty. For ex- 
ample, the accumulation of rare information and the ability to define 
unusual words are not in themselves true measures of difficulty; they 
may reflect only differences in experience. On the other hand, however, 
over and above disparity in experiences between various age groups, true 
differences in difficulty do exist between problems that can be solved, 
let us say, by a group of average 10-year-old children, and those that can. 
be solved by an average group of 8-year-olds or 9-year-olds. 

"Complexity" refers to the number of different kinds and varieties of 
tasks that can be dealt with successfully. According to this attribute of 
intelligence, the individual who is able to deal successfully with several 
different kinds of tasks, at a given level of difficulty, is more intelligent 
than a person who can successfully undertake fewer kinds of tasks at the 
same leve] of difficulty. "Complexity," however, means not simply the 
addition of one type of performance to others; on the contrary, it means 
the capacity to assimilate new abilities, to integrate them with others, and 
thus to reorganize one's patterns or forms of intelligent behavior. 

"Abstractness"—that is, operating with symbols, especially at levels of 
analysis and interpretation—has already been discussed. For Stoddard, 
this attribute "lies at the heart of intelligence as defined." 

"Economy" refers to the rate at which mental tasks are performed and 
Problems solved. Assuming that the problems are solved equally well, that 
the solutions are equally effective, the individual working more rapidly 
would be regarded as the more able, according to this attribute. Ac- 
ceptance of “economy” as an attribute of intelligence means that tests 
would impose time limits that should differentiate among individuals in 
respect to their rates of performance of tasks and solutions of problems 
at given levels of difficulty and degrees of complexity. / 

"Adaptiveness to a goal" implies an approach that is more than aim- 
lessly meeting and solving new situations as they arise. This attribute 
means that intelligent action is directed toward a goal or a purpose. The 
More comprehensive the goal and the larger and more complete the 
Purpose, the more is intelligent action required. . 3 

The student, after examining representative tests of intelligence, might 
well question whether they do, or even could, satisfactorily test this last 
attribute; or whether the problems and tasks included in the tests are 
rather Oversimplified and segmental examples of problems and courses of 
action that a person has to confront and deal with in actual life situations. 
Ë the test items are of the latter kind, then their value and validity as 
Measures of intelligence must be shown by the fact that they do indeed 
Predict to an adequate degree the manner and effectiveness with which 
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the testee will deal with and solve actual life situations of broader scope. 
In other words, what are the predictive values of the items, tasks, and 
problems included in a test? l , 
The inclusion of “social value” as an attribute of intelligence is of 
doubtful validity, and debatable at best; for this criterion is essentially 
moral or ethical, or a matter of subjective evaluation. The basis of “social 
value” is group acceptability. If this attribute were applied in evaluating 
intelligence, we should have to minimize our estimates of the intelligence 


of individuals whose thinking and solutions of problems 


are not neces- 
sarily consistent with accepted social forms, though they might be “ahead 
of their time”; 


; and of individuals who are capable of difficult and com- 
plex mental operations, but whose mental acti 


or demonstrable practical and social v 
highly the individual whose mental o 
acceptable, and useful social outcomes, 


an attribute of intelligence would confuse attempts to measure the other, 


and valid, attributes by injecting largely subjective conceptions of what 
is socially acceptable, unacceptable, or indifferent. It will be seen later 


that "social value” is hardly present in current tests of intelligence; al- 
though, of course, some psychologists, like Stoddard, take the position 


vities lead to no apparent 
alues. While we may value more 
perations culminate in desirable, 
the inclusion of "social value" as 


that it should be, 
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Stoddard's last two conditions of intelligent behavior—"'concentration 
of energy" and “resistance to emotional forces"—are subject to the same 
criticism as Wechsler's inclusion of "drive" and "incentive." Motivation 
and ability to exert sustained effort are usually regarded as nonintellec- 
tual aspects of activity and are certainly recognized as playing highly 
important roles in one's general effectiveness. But to introduce them 
into a test of mental ability would be to confuse and probably to in- 
validate efforts to arrive at a reasonably valid measure of the level of 
intelligent activity at which a given person is able to operate, whether 
9r not he actually does operate at that level in all situations. 

Although tests of intelligence do not directly measure motivation and 
Concentration of energy on the solution of problems, the psychological 
examiner does in fact try to develop or encourage conditions wherein the 
persons being examined will operate at their maximum levels of ability. 
This can be more nearly achieved in individual testing than in group 
testing. Furthermore, if an individual is not adequately motivated, is not 
expending a maximum of energy during the test, or is handicapped by 
€motional factors, these conditions can be discerned much more readily 
during the examination of one person at a time than during the examina- 
tion of a group all at once. Indeed, during group testing there may be 
instances of persons whose test results are vitiated by the effects of these 
unfavorable conditions, unknown to the examiner. This possibility is a 
disadvantage of group testing. : A 

No single test or series (battery) of tests can provide an unfailing index 
or a guarantee of motivation, energy output, or freedom from emotional 
blocking in all future situations requiring intelligent behavior. For man 
is not a static being; nor does the environment in which he lives remain 
Static. Long-term motives and immediate incentives will change; values 
and interests will change. The affective (emotional) quality of à person's 
€xperiences will influence his subsequent behavior, including situations 
requiring the utilization of intelligence. Tests of intelligence now in use 
are not intended to determine the extent to which an individual will in 
the future concentrate his energy on problems demanding the use of his 
intelligence, or to determine whether it is probable that he will be able 
to remain free from emotional blockings. A variety of personality rating 
Scales and inventories and projective techniques have been devised to 
Evaluate these nonintellectual traits. Although tests of intelligence will 
be improved so that greater demands will be made upon concentration 
of attention and sustained effort than is the case with some tests at present, 
Psychologists believe that a qualified examiner will be able to determine 


ku rs t m : 
Score very high on these tests without having exceptional powers US mid We need 
tests of originality, but in view of the very nature of the concept and its expressions, such 


tests cannot very well be standardized. 
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whether or not a given person's performance on an individual test repre- 
sents his maximum level at that time. They believe, also, that ‘Most 
persons can be motivated to perform at their best levels when taking a 
group test. When groups are tested, this condition must be reasonably 
well assured as well as being assumed.3 

Although intelligence tests are not designed to measure a person's emo- 
tional status and other nonintellectual aspects of personality, clinically 
oriented psychologists analyze an individual's test performance for evi- 
dence of emotional states, personality "mechanisms" and for "differential 
diagnosis" (that is, for evidence of neurosis, psychosis, or other atypical 


states). This aspect of test interpretation, which demands sensitive clinical 
insights, is discussed in Chapter 14. 


Implications for Test Design and Content 


Definitions of intelligence are of more than theoretical impor- 


ligence that a psychologist holds will affect, 
ontent and organization of the test he de- 
an examination of a representative group 
though some are different from others in 
uch in common, nevertheless. It would be 


Te that certain tests exemplify exclusively the 
definition that intelligence is the capacity to learn. Because psychologists 


emerge with tests having considerable similarity, although they may start 
with different definitions, it follows that their definitions differ largely 
in respect to emphases and that, as already pointed out, they are inter- 
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and thus to produce more useful and successful instruments regardless 
of the exact definition with which each one started out, found that in- 
evitably their testing instruments were broader than their definitions. 
Current tests, as a result, have more than a little in common in spite 
of differences in details of their content. Inspection of their content will 
show that in varying degrees they are measures of some aspects of learning 
in what is assumed to be a reasonably uniform environment for all 
persons,! that novel situations and problems are presented, and that abil- 
ity to carry on abstract thinking is tested through the utilization of 
Symbols and ideas. Inspection will demonstrate also that most of the 
tests fail to meet the more comprehensive and long-term attributes sug- 
Bested in Stoddard's definition. 

So far as available tests are concerned, it is important to bear in mind 
that the French psychologist Binet (the “father” of modern mental test- 
ing) took the position that it made little difference what specific tasks 
and items were incorporated into a test, provided that in some degree 
each part was a measure of the individual's general capacity. Whether or 
not this condition is met will depend, of course, upon the definition of 
Intelligence regarded as most adequate by the designer of a test and upon 
the criteria of intelligent activity against which test results are validated. 
In spite of some differences in definition and in spite of some differences 
in external appearances, psychologists believe that their tests are reason- 
ably sound because they are related to, and have value in, predicting 
the likelihood of intelligent activity in life situations. 


Three “Kinds” of Intelligence 


Some psychologists believe that several kinds of intelligence 
should be distinguished from one another. Noteworthy among them is 
E. L, Thorndike, who has divided intelligent activity into three types: 
(1) social intelligence, or ability to understand and deal with persons; 
(2) concrete intelligence, or ability to understand and deal with things, 
as in skilled trades and scientific appliances; (3) abstract intelligence, or 
ability to understand and deal with verbal and mathematical symbols. 

The merit of this classification of types of intelligent activity, for 
PSychologica] testing and diagnosis, is that it indicates several realms in 
Which persons might be functioning and implies that separate and suffi- 
Clently specialized tests might be devised to measure how effectively 


Persons are functioning in each. : 
Although it is true that any given person's scores on a test using verbal 


d “This raises the much-debated problem of heredity and environment as factors in the 
evelopment of intelligence. 
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and numerical abstractions might differ appreciably from those attained 
by him on a test of social relationships and insights, or on one af apt 
crete" intelligence, it is also true that, when a representative group = 
individuals is tested, the correlations between the types of tests are found 
to be positive and significant, both statistically and psychologically. For 
example, correlations between tests of verbal and of concrete -a 
vary from about .25 to about .45, the average being about -30--35- A f 
though this is a somewhat low average correlation, it still indicates that 
some communality of function is being measured. This index, being so 
far from unity, also means that there are numerous individuals whose 
relative scores do not correspond closely or whose two relative scores 
may be discrepant. This fact points up the important psychological prin- 
ciple that the data and status of any single person may be inconsistent 
with the general trend. Study of the individual, and the ways in which 
and the reasons why he deviates from or exemplifies general trends, is 
one concern of the clinical psychologist. 

Of the three kinds of abilities enumerated above, abstract intelligence 
is the one that receives greatest weight and is most pronounced in cur- 
rent tests of intelligence—that is, whenever the test is designed for use 
with persons who are presumed to have reached a level where they may 
be expected to have developed facility in dealing with concepts and 
symbols. . 


Even tests that present the subject with "things" rather than with ideas 
and symbols are not devoid of demands 
and make abstractions, 


in the form of languag 


upon ability to conceptualize 
although testees need not necessarily state these 


€ and number. For example, when a subject 35 
required to arrange a series of pictures into a sequential and meaningful 


whole, he must at some stage form a concept of "the whole" if his re- 
sponse is to be correct by some means other than pure chance. He must 
do this, also, in assembling parts into an integrated unit (called “object 
assembly”). The same is true of the child who is asked which is the 
“prettiest” of two pictures (“esthetic comparison”), for he must have à 
concept of "prettiness," however unarticulated it may be. There are AD 
use many other types of test items that deal with things but still require 
more or less ability in concept formation. Among these are object classi- 
fication, tracing the shorter of two routes in a maze 

by use, and supplying missing parts in the drawing of a human figure. 
In short, the fact that some types of test items do not employ language 
or number does not necessarily signify that they make no demands upon 


ability to reason at a level of concept formation and abstraction. 
At the earliest 


$ št developmental levels there are tasks that depend upon 
visual-motor skill, such as tying a bow knot, grasping a ring, holding 4 


, identifying objects 
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pencil and scribbling, and manipulating cubes. These types of tests, how- 
ever, are restricted principally to the first eighteen months of life. They 
are useful as developmental indicators, but they have only slight pre- 
dictive value for later development of mental abilities, as measured by 
tests at more advanced levels.* 

The role of ability to deal with ideas and symbols (words and num- 
bers) as a measure of concept formation and abstraction is of increasing 
importance in tests of general ability (intelligence) as age level increases. 
Proportions of verbal and numerical tests, on the one hand, and non- 
verbal, nonnumerical, on the other, undergo change at different age 
levels ®; some tests include a larger proportion of the latter than do 
others, even at the adolescent and adult levels.” These differences are not 
haphazard, nor are they matters of individual whim; they depend upon 
the purposes of the test and the test author's conception of intelligence 
and its constituent parts. It will be seen in later chapters that the correla- 
tions between various tests of general ability are quite marked—and at 
times high or very high—thus indicating that to an appreciable degree in 
these tests the verbal and numerical items on the one hand, and the non- 
verbal, nonnumerical on the other, are measuring the same or closely 
related functions.’ High intercorrelations do not always mean that the 
same functions are being measured by the tests concerned; such correla- 
tions may reflect other common factors that affect the tests being corre- 
lated. This, however, is quite improbable as an explanation of test 
Intercorrelations. 


Analyses of Mental Ability 


The definitions of intelligence thus far discussed are functional 
in character; that is, they state how intelligence operates: through learn- 
ing, adaptation, abstract thinking. But, in addition, psychologists have 
been concerned to know the "structure" of intelligence. They have made 
analyses in an effort to determine its underlying factors. Or, otherwise 
stated, the purpose of these analyses has been to discover, if possible, the 
elements, or components, of intelligence, not only for a better theoretical 


reschool Children,’ in which this problem 


x * Sce Chapter 13, "Scales for Infants and P 
as discussed at length. 
See the revised Stanford-Binet scale 3 
scale, the Minnesota Preschool Test, the Detroit Kindern Det TM . 
? See the Otis tests, the Kuhlmann-Anderson tests, ha ec ger cllevue Intelligence 
3 eT! ke tests. 
ests, the Pintner-Paterson scale, and the Lorge-Thorndi te s a 
8 The statements in this paragraph do not mean that the nonverbal materials in the 
tests are measures of mechanical ability. Generally, they are believed by the authors 
Of the tests to measure the same psychological processes as do the verbal materials, but 


Y means of different content. 


below the age-5 level; also the Merrill-Palmer 
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understanding of this complex process, but also to learn what might be 
the implications for the design and construction of mental tests. 

It is not to be inferred, however, that the dynamics of intelligent activ- 
ity can be adequately understood merely by enumerating and character- 
izing the components, whatever they might be. Whatever the components, 
they do not operate independently or in isolation. Understanding the 
dynamic aspects of mental activity requires some means of characterizing 
the organization of factors, their interrelationships, and their relation to 
motivational forces. 

Essentially, the experimental method followed is this: a rather large 
number of separate tests, more or less diverse in character, are given to 
an adequate sampling of the population. The results of each type of test 
are correlated with those of all the others. The coefficients of correlation 
are then subjected to various techniques of statistical analysis in an effort 
of common ground between them (technically 
and their degree of independence. These statis- 
n as factor analysis? The particular theory or 
educed from the statistical operations will de- 
interpretation of the analysis; and the experts 
heir interpretations, These differences, however, 


indicates, intelligence 
€ factors, or elements, 


If two types of mental activities, 
n are 4 and C, the reason, accord- 
€ that the first pair has more ele- 
n does the second pair. According to this theory, 
© such factor as "general intelligence"; there are 
er of such depending upon how 
make and are capable of making. 


y an “atomistic” theory of mental ability. He adds, 


ote 
Discussed later in this chaptei. 
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however, that certain mental activities have so many of their elements 
in common that it is useful to classify these tasks into separate groups to 
which special names are given; for example, verbal meaning, arithmetical 
reasoning, comprehension, visual perception of relationships, and others. 
Consequently, in constructing a mental test, it appeared to Thorndike 
himself that his "atomistic" theory and the multitude of minute elements 
of ability are of less practical significance than the conception that many 
of them operate together in any situation demanding intelligence. This 
is illustrated by Thorndike's test designed to measure ability to deal with 
abstractions. His test is composed of four parts: sentence completion (C), 
arithmetical reasoning (A), vocabulary (V), and following directions (D). 
This instrument is known as the CAFD test. It is not claimed by Thorn- 
dike that these four sets of items encompass the entire range of abstract 
intelligence, "They represent and sample only certain parts; but because 
of the significant correlations between all types of measures within the 
tested range, it is held, the other aspects of abstract intelligence can be 
estimated with satisfactory accuracy from those portions that are actually 
measured by this test. 1 

The Two-Factor Theory. Opposed to Thorndike's theory of 
the nature of intelligence is Spearman's two-factor theory, which stands 
at the other extreme of interpretations. According to Spearman, all intel- 
lectual activity is dependent primarily upon, and is an expression of, a 
general factor common to all mental activity. This factor, designated by 
the symbol g, is possessed by all individuals, but in varying degrees, of 
course, since people differ in mental ability; and it (g) operates in all 
mental activity, though in varying amounts, since mental tasks differ in 
respect to their demands upon general intelligence. Spearman charac- 
terized this general factor as mental energy, because in the realm of in- 
telligent activity, he maintained, it has a role, similar to that of physical 
energy in the physical world. Like all other scientific concepts, the general 
factor can be observed and known only through its specific manifestations 
—in this instance, through psychological tests. After analyzing tests with 
varying amounts of the general factor, from high to low, Spearman con- 
cluded that the principal distinguishing characteristic of tests highly 
“loaded” with g is that they require insight into relationships—what he 
called “the eduction of relations and correlates." For example, in solving 
an arithmetical problem, the subject has to grasp the relationships be- 
tween the data presented, organize them with reference to the propo- 
Sitions given in the problem, and deduce a correct answer. The g-content 
In this task is high. By contrast, if the subject. merely has to repeat a 
table of multiplications or add a few numbers—both of which can be 
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learned by rote—no insights are necessary and no relationships need 
be grasped. In this task, the amount of g involved is very small.'° 
Spearman postulated the g factor, in the first place, to explain correla- 
tions that he found to exist among diverse sorts of perceiving, knowing, 
reasoning, and thinking, as illustrated in Table 7.1. That is to say, he 


TABLE 7.1 


INTERCORRELATIONS OF SUBTESTS 


1 2 3 4 5 6 7 
(1) Analogies 50 49  .55 49 45 45 
(2) Completion .50 54 47 50 338 34 
(3) Understanding paragraphs 49 54 49 39 44 35 
(4) Opposites 55 EVI 49 Al .82 35 
(5) Instructions 49 50 à .39 4l 32 40 
(6) Resemblances 45 38 44  .92  .32 35 
(7) Inferences 45 34 35 35 40 35 


ee ee) EE Ee eS 


Source: C. Spearman (6, p. 149). By permission. Spearman's method of statistical analy- 
sis is presented later in this chapter. 


concluded that all mental activity is to some extent dependent upon, 
and an expression of, this general factor; and the magnitude of the 
correlation coefficient found between any two forms of mental activity 
reveals the extent to which this g factor is operative in each, and com- 
mon to both. Thus, the amount of the general factor operating in each 
activity will determine the size of the correlation between the two mental 
activities being measured. The types of materials used in current tests 
of intelligence—word meaning, arithmetical reasoning, sentence com- 
pletion, reasoning by analogy, paragraph interpretation, perception of 
relationships in geometric forms, picture completion, and others—all 
show significant degrees of positive correlation with one another. Spear- 
man and his supporters at first ascribed this fact to the presence of g, 
in greater or lesser amount, in all of them. Later researches led them to 
conclude that certain "group factors" are also present in some mental 
activities. These are the factors that occur in more than one type of test 


item, but i i 
rum t in less than all of any given set of tests. The general factor, 
ever, still remains the primary and pervasive one. 
? When usin indivi i 
TA § an individual test, like the Stanford-Binet or the Bellevue, it has often 


t a subject who is unabl - ical 
€ even to make a start e to solve an arithmet 


oe  —  w—— —— —— pe E. a 


Since the intercorrelations are by no means perfect, Spearman postu- 
lated the existence of specific factors, called s factors, each of which is 
specific to a particular type of activity. Thus, the two-factor theory states 
that all mental activities have in common some of the general factor; each 
mental activity might also be a member of a "group"; and each has also 
its own specific factor. Of the kinds of factors, the general one is regarded 
as the essential measure of intelligence; accordingly a sound test of intel- 
ligence is one that will sample adequately the g factor in a variety of 
activities, and the best test materials are those that call for the largest 
amount of the general factor. And the largest amounts of the general 
factor are believed to be demanded by those types of test materials that 
have the higher intercorrelations with one another. 

As a matter of fact, since the beginning of modern mental testing, 
PSychologists have proceeded, at least implicitly, on the assumption that 
all forms of mental activity have something in common—that they are 
similar in certain basic respects. Otherwise, psychologists could not have 
Justified their practice of testing together in a single instrument such 
diverse mental activities as defining words, solving arithmetical problems, 
finding similarities and differences, repeating digits forward and back- 
ward, completing sentences in a meaningful manner, and perceiving geo- 
Metric forms. All of these, and the others used, must have been regarded 
as being measures, to a greater or lesser degree, of general intelligence. 
From the total performance on these tests, it was believed that an indi- 
vidual's level of general intelligence would emerge. Therefore, psycholo- 
Bists believed they were justified in adding up the test items correctly 
activity and deriving a single total score to 
eral intelligence level. This is the actual 
idual as well as group scales of 


Passed in the several types of 
represent an individual's gen 
Practice in nearly all tests, including indiv 
Mental ability. : 

The practical implication of the Spearman two-factor theory is clear, 
So far as test construction is concerned. A test conforming to this theory 
Would be one whose materials and several parts are saturated with the 


| " While this practice is not being discontinued, and should not be, increased emphasis 
55 now being placed on the desirability of representing each individual by means of a 
test profile, where possible, as well as by a general index. There are some psychologists, 
however, who would abandon the use of all indexes of general level and would sub 
Stitute a profile representing the individual's relative rank in each of the specific types 
of test materials being used, such as numerical ability, word meaning, spatial percep- 
tion, and the like. 


°? In addition to g and s, Spearman and others have found by further analysis of experi- 


mental results that there are some nonintellectual factors—such as volition, interest, 
Persistence—that influence a person's effectiveness. Spearman and adherents of his 


theory have also discerned a few groups of factors that are intermediate between g and 


the highly specific s. They suggest that musical aptitude and mechanical aptitude are 


of this type. 
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general factor, so that measurement thereby "ces i" eed Mats 
and quality of g to emerge, while the effects o specific inr pal 
be canceled out. Thus, the net result of the test would be a measu z 
To achieve this would require a skillful selection and aoe = 
test problems and parts that are significantly intercorrelated, whic : a 
the same time satisfy the practical criteria of intelligent activity. Such a 
test, presumably, would yield an index that reflects the caliber of a par- 
ticular mentality working as a whole. S 

The Group-Factor Theory. Intermediate between the theor ies 
of Thorndike and Spearman are the group-factor theories; prominent 
among them is that of Thurstone. His work has received the most atten- 
tion and has resulted in the construction of a set of measures called tests 
of primary mental abilities, ] 

According to the group-factor theory, intelligent activity is not ani ex: 
pression of innumerable highly specific factors, as Thorndike claimed. 
Nor is it the expression primarily of a general factor that pervades all 
mental activity and is the essence of intelligence, as Spearman held. In- 
stead, the analyses and interpretations of Thurstone and others led them 
to the conclusion that certain mental operations have in common a “pri- 
mary” factor that gives them psychological and functional unity and that 


differentiates them from other mental operations. These mental opera- 
tions, then, constitute a “group.” A seco 


its own unifying primary factor; a thir 
other words, there are a number of gro 
being as yet undetermined), each of 
giving the group a functional unity a 
mary factors is said to be relatively in 

After administering a large variety 
and to high school and eighth-grade 
tional and factorial analyses of the 
laborators concluded that six primary f 
identification and use in t 
the following (17): 15 


ups of mental abilities (the number 
which has its own primary factor, 
nd cohesiveness, Each of these pri- 
dependent of the others. 

of test materials to college students 
pupils, and after making correla- 
results, Thurstone and his col- 
actors emerged clearly enough for 
est design and construction. They are, briefly, 


The Number factor (N): “ability to do numerical calculations rapidly and 
accurately.” 


The Verbal factor (V): 
The Space factor [3 S 
lates an object imaginal] 


“found in tests involving verbal comprehension. 


involved in any tasks in which the subject manipu- 

y in space.” 
™ Some modifications of factors have been introduced in recent issues of these tests for 

younger subjects, The si 

abilities do not inclu 

example, mechanical, 


1 e musical, or artistic aptitudes. T) 
quired in abstract inte 


lligence and in academic learnin 
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The Word Fluency factor (W): "involved whenever the subject is asked 
to think of isolated words at a rapid rate." 

The Reasoning factor (R): "found in tasks that require the subject to 
discover a rule or principle involved in series or groups of letters." Although 
it is believed both induction and deduction are involved, it seems that in- 
duction is the more significant here. 

The Rote Memory factor (M): involving "the ability to memorize quickly." 


Although primary mental abilities (or factors) were originally said to be 
functionally independent of each other, it was actually found that they 
are positively and significantly intercorrelated, as shown in Table 7.2. 


TABLE 7.2 


INTERCORRELATIONS OF SUBTESTS 


N Ww V S M R 


N 
w A 

V 40 54 

S 38 17 16 

M 31 36 35° 18 

R 58 — 49 .59  .20'- 89 


N, number facility; W, word fluency; V, verbal 
meaning; S, spatial perception; M, rote mem- 


ory; R, reasoning. 
Dus L. L. and T. G. Thurstone (17). By 


permission. 


This must mean that the primary and presumably independent factors 
are not the only factors at work in the mental activities required by the 
tests. There must be some other factor, or factors, to account for the 
common ground (as shown by the positive correlations) that exists be- 
tween the various psychological tests intended to measure these primary 
factors. In other words, it seems that the test authors have not been able 
to devise test materials that will sample the primary mental abilities in 
pure form. The Thurstones, therefore, concluded that in addition to the 
Primary abilities, there is a “second-order general factor." They also 
Stated, in their earlier test manual, that “If further studies of the primary 
Mental abilities should reveal this general factor, it may sustain Spear- 
man’s intellective factor" (17, p- 7)- TE 

Subsequent studies of the primary mental abilities do tend to reveal a 
Beneral factor. The more recent intercorrelations found among the several 
tests that make up the PMA !! batteries are quite marked, especially at the 


“Primary Mental Abilities. 
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lower age levels (when abilities are less differentiated through education 
and interest than they will be in later years). 

For the tests at the 5- to 7-year level, the intercorrelations range from 
.46 to .67 (average equals .55). For tests at the 7- to 11-year level, the range 
of coefficients is from .41 to .70 (average equals .50); while for ages 11 to 
17, the range of coefficients is from .13 to -59 (average equals .304-). It ap- 
pears, therefore, that the group-factor adherents have found it necessary to 
posit the operations of a general factor; but at present they regard it as 
being of a "second order." 

In evaluating the group-factor hypothesis we need not question the 
soundness of the statistical methods used or the comprehensiveness of the 
experimentation.15 Several observations are necessary, however, to enable 
the reader to make a fuller assessment of the hypotheses. In the first place, 
intelligence is not an entity that operates in a vacuum; it is not something 
"given," even in the sense that some physical traits are "given," such as 
the color of €yes and hair, the number of digits. Intelligence is, rather, a 


name for certain kinds of activity; we can know of it only through its 


manifestations in behavior. Intelligent behavior develops and is mani- 
fested in one kind of environment o 


r another; hence, the particular form 
of expression that intelligent activi 


ty takes will depend upon the sort of 
functions developed and fostered in a given cultural environment. In 


our own and similar cultures, verbal and numerical abilities are essential; 
they are fostered from earliest childhood, and they receive greatest atten- 
tion and emphasis in our schools. Consequently, there is a relationship be- 
tween this cultural emphasis and the fact that three of Thurstone's six 
primary factors are concerned with words and numbers. It is probable also 
that the "Space" factor emerges from statistical analyses because of our 
experiences with things in three-dimensional space. The two remaining 
factors, "Reasoning" and "Rote Memory," are characteristic, in greater or 
lesser degree, of all persons regardless of the particular culture; we should 


* Some psycholopi critici m 

Hume psy eng have criticized adversely Thurstone's ethods and his interpre- 
5 A 
" Apprehension of one's own experience, 


of correlates, the eduction of relations, and the eduction 
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The proponents of the group-factor hypothesis do not claim that the 
€xact number of primary mental abilities is known. Hence, we must cau- 
tion the reader against assuming that there is a finality about the present 
number. For example, a factor of "Speed" will not appear in a statistical 
analysis of test results unless speed of performance is, first, a variant in 
the population being measured and, second, a requirement within the 
tests themselves. The same can be said of “Persistence” or "Mental Fa- 
tigue.” Similarly, “Originality” would appear as another factor, if it could 
be measured. 

These observations do not invalidate the tests that have been or might 
be designed on the basis of the group-factor hypothesis. As a matter of 
fact, the contents of such tests thus far published, though differently or- 
Banized, are not radically different in their essentials from those designed 
on the basis of either of the two other theories of the nature of intelli- 
gence. 

For our present purposes, two consequences of group-factor analyses 
are indicated, F irst, the conceptual framework has resulted in more clearly 
Specified and defined test categories and types of test items than was the 
case previously. Second, several batteries of tests have been constructed on 
the basis of group-factor theory. : 

The early versions of the PMA tests did not yield an over-all index of 
Performance, such as mental age, intelligence quotient, or an over-all 
Percentile rank, Instead, they gave, for each subject, separate percentile 
ranks to represent his performance level in each of the primary factors. 
These ranks were then used to make a “profile” for each person for edu- 
cational and vocational guidance. While they granted that the single in- 
dex (mental age and IQ), based upon a variety of mental activities, is use- 
ul for many practical purposes, group-factorists originally maintained 
that their method of finding separate ratings for each of the primary 
factors enables the examiner more readily and adequately to recognize a 
testee’s marked mental abilities and disabilities, the degree of uniformity 
Or lack of uniformity. à 

There is merit in this contention; yet, at the same time, there is no 
800d reason why the group-factor type of test should not also yield an 
Over-a]] rating (such as MA or IQ) as well as indexes of relative rank for 

* Thi i up test or the Bi 
iu rd Aera a rnc ar dower beer 
studied, In the case of group tests, the individual's performance on each of the several 
Parts can be compared with his performances on the other parts. The disadvantage here 


i > 
Aw the usual group test does not provide separate scores and norms for each of the 


,.. A weakness of the group-factor type of test is that the breakdown into Separate factors 
ignores the fact that intelligence expresses itself in behavior as a combination, a unity 


"nctions, not as a series of independent factors. 
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each of the specified factors. While mental age and intelligence quotient 
should never be interpreted and used mechanically and uncritically, or in 
disregard of the specific performances that have contributed to them, they 
do nevertheless have considerable significance and valuable connotations 
for the qualified examiner and interpreter. . A E 

The group factorists have apparently recognized this point in their 
more recent interpretations of test findings. Equally important to them 
is the fact that their correlations and factorial analyses have persistently 
yielded results that could not be explained in terms of group factors 
alone, and that the g factor was indicated. As a result, the most recent edi- 
tions of the PMA tests provide IQ equivalents for the scores on the scale 
for ages 11 to 17, but for the younger age levels both MA units and quo- 
tients are provided. 

As is so often the case in scientific problems—especially in the relatively 
new ones—divergent theories in time tend to come into closer agreement. 
The Spearman T wo-Factor Theory now recognizes that some group fac- 
tors should be posited to explain test findings; but emphasis is upon the 
g factor. Perhaps the Spearman theory may now be renamed “The Gen- 
eral Factor-Group Factor Theory,” and the other might be renamed “The 
Group Factor-General Factor Theory.” The narrowing of differences be- 
tween the two theories represents significant scientific progress. 


Factor Analysis 


The two-factor and the group-factor theories are the two most 
prominent examples of doctrines emerging from the methods of factor 


analysis. Although this subject is highly technical, it is desirable to ex- 
plain it more fully at this stage. 

The technique is essentially a search for the psychological functions 
that are at the basis of and determine test performance. All techniques 


of factor analysis are statistical and based upon the correlation coefficient. 
After the statistical calculations 


/ í have been made, it is necessary for the 
investigator to bring to bear his 
name his statistical findin 


(0) 


ncluded and so grouped as to measure only, or 
and statistical analysis. 


rs he has segregated from his preliminary testing 
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The factor analyst does not begin with a definite set of preconceived 
mental functions. He tries to discover which psychological functions, or 
components, are necessary to explain his data. Yet, it should be noted, he 
must at the outset have some conception of the kinds of test items to in- 
clude in preliminary experimentation. Thus, what he ultimately distills 
Out as factors are basically dependent upon his original conceptions regard- 
ing his preliminary items. The factor analyst, in seeking the components 
of intelligence, for example, does not start with tests of color perception, 
tone discrimination, or finger dexterity. 

Two-Factor Theory. We have already stated Spearman's two- 
factor theory. It will be helpful now to describe in more detail the rea- 
soning that led to the theory. Spearman, in his early experimentation, was 
impressed by the fact that all the intercorrelations were positive in a table 
of coefficients for various types of items. He was also impressed by what 
appeared to be a hierarchy of coefficients in the rows and columns of the 
table; not perfect gradations, but strong evidence of proportional grada- 
tions. He therefore offered a hypothetically perfect table of correlation 
coefficients to illustrate his point (Table 7.8). Not only are the coefficients 


TABLE 7.3 


SPEARMAN's HyPOTHETICAL TABLE OF CORRELATIONS 


1 2 3 4 5 
ee ee A, 
80  .60 380. -30 


(1) Opposites 

(2) Completion .80 48  .94 .24 

(3) Memory .60 48 18 18 
30 24 18 .09 


4) Discriminati 
(4) Discrimination 30 .24 18  .09 


(5) Cancellation 


Source: C. Spearman (6, p 74): By permission. 


ong rows and columns, but theoreti- 
the table is symmetrical about the 
tions) are in direct proportion. 
that the following correlational 


Positive and in a decreasing order al 
cally any two columns (or rows, since 
diagonal which contains the self-correla 
The criterion of proportionality requires 
relationships should hold: 


Tus; Dua _ 8, 
Tos 724 728 


Taking only the first two ratios, 


Tis _ T14 
== om , 
Tog — 724 
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and multiplying by the denominators, we have 

Ti3 T24 = Tog T14- 


Transposing, the result is 
Ti3 To4 — T23 T14 = O. 


i lt 
From this tetrad equation (so called because the test correlations tg 
with in sets of four) may be obtained what is known as the tetrad di 
ference. i 2 " 
The tetrad equation may be written for the combination of any ies 
tests. By rearranging the four coefficients, three tetrad differences may 


obtained for every combination of four tests. Thus, using ¢ as the nota- 
tion for tetrad difference: 


li234 = Tie T4 — T13 Tog 
teas = T1234 — Try Teg. 
bisse = 113 Tos — T14 Tog. 
Theoretically, 


the tetrad-difference criterion is satisfied if t is zero. 
When it is zero, 


Spearman and others have offered mathematical evi- 
dence to demonstrate that a single common factor can account for the 


relationships among the four tests, or variables. But in fact, the difference 
is rarely, if ever, zero. However, 


may also conclude the criterion is 
will be affected by chance errors 
(see discussion of reliability in 
coefficients would also be affect 
in each test. The specific factor 
the tetrad differences that weri 

If more than four tests are b 
volved, we may substitute tes 
tetrad equations, in place of 
analyzing tests 1, 2, 5, 
fied, it would be concluded that the functions common to 1 and 2 are 
identical with those 


if the differences are close to zero, we 
satisfied, since correlations between tests 
of measurement and accidental factors 2g 
Chapter 4). Furthermore, the correlation 
ed by the operations of the specific factor 
was postulated to explain, in part at least, 
€ greater than zero. 


eing analyzed to disclose the functions in- 
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dependent upon the extent to which g is involved in each. Subsequent in- 
vestigations showed, however, that some test intercorrelations may include 
their own factors, common only to them, beyond the single common g. 
It was necessary, therefore, to postulate the operations of group factors, 
each group being effective in two or more tests, but not in all of them. 
Spearman and others recognized such group factors as numerical, verbal, 
speed, mechanical, imagination, and attention. In addition, Spearman 
had postulated three nonintellective factors that influence one’s mental 
effectiveness. These are perseveration (p) oscillation (0), being one’s 
variability in performance in continuous mental activity; and will (w), 
being one's persistence in effort. 

The tetrad-difference criterion does not in itself show the relative 
Weight or importance of the common factor in each kind of test. Fol- 
lowing the work of Spearman, therefore, methods have been developed 
for finding the weights (commonly called “Joadings”) of each factor— 
Beneral or group—in each of the intercorrelated tests. These methods are 
known as “factor pattern analysis” (4, 18, 18)- A! : 

The two-factor theory can account for the universal positive correlation 
Coefficients among the various kinds of test items included in scales to 
measure mental ability, since every form of test requires the operation of 
& to some degree. Pooling a variety of kinds of tests in a scale is sound 
Practice, according to this theory, because we thereby approximate a 
Measure of pure g. Since the s factors are uncorrelated within any indi- 
Vidual— that is, they may be possessed in varying and random degrees by 
a they will be a negligible factor in the total paman ina TEW 

o RP i rs will tend to cancel ou 
one eo ability, because the varied s facto t 
3 Sampling Theory. The two-factor theory has been criticized by 

Ome statistical psychologists, notably G. H. Thomson and L. L. Thur- 
Stone, Thomson offers a sampling theory to explain the same tables of in- 
tercorrelations (11). Briefly, his view is that the coefficients of correlation 
are the results of common samplings and combinations of independent 
actors. The number of common independent factors utilized by two tests 
will determine the coefficient of correlation between these two. This 
theory is, of course, the same as Thorndike’s, except that Thomson con- 
cedes the practical usefulness of a concept like g. Thomson also adds that 
if several tests call upon many elementary factors in common, they will 
Not on] igh coefficient of correlation, but they 


>. Only have a very marked or hi , 
Will give the appearance of having one common comprehensive factor. 


50, Thomson’s theory maintains that if several tests draw upon a rela- 
tively smaller number of the elementary factors in common, these are group 
actors—that is, a limited number of factors that enter into performance 
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on types of tests which are distinguished by the fact that they have = 
tain mental processes in common but do not share a very large num 
mentary factors or a universal g. 
ix oin M wo theories require that a scale measuring general s 
ability should pool a variety of tests that differ in content and d 
processes employed, the two-factor theory requires subtests (parts o à 
scale) that have high intercorrelations. The sampling theory, on the ec 
hand, requires subtests having low intercorrelations among themse 
but high correlations with the criteria of validity. uM 
Group-Factor Theory. As already stated, Thurstone anc : n 
believe that a group-factor theory fits the facts best and is most use cs 
testing practice. Their view differs from "Thomson's in that they ae 
the theory of a very large number of independent factors. As previou 1 
explained, a group factor is conceived of as an operational concept . 
account for correlations of performance within only a limited group F 
tests.1° Several different groups of factors are necessary to account for 2 
mental activity, plus the more recently added second-order g factor; 
which Thurstone states may be more "central" in character and more 
"universal" in influence. 

Thurstone has contributed significantly to the methodology of qi 
factor analysis; and from his analyses, as we shall see, has emerged a scale 
to test mental ability. This volume is not the place to present his tech- 
niques; we shall merely state his purposes.”0 1 

Three objectives, according to Thurstone, are to be achieved by e 
analysis: (1) determination of the smallest number of primary menta 


abilities to be postulated as an explanation of tables of intercorrelations; 
(2) determination of the amount of each prim 


ary ability that is involved 
in each test; and ( 


3) determination of regression equations whereby BE 
amount of a primary mental ability in an individual can be estimated 
from tests that draw upon that ability. As an illustration, we may co 
sider several tests in which only two group factors are involved. An A 
vidual might make a high score in these tests either by having a moc 
erately high level of ability in each of Factor I and Factor 1I, or by having 
very much of one and little of the other. Also, if Factor I carries much 
heavier weight in the tests than does II, then a high level of ability on : 
is more important for high performance on these tests than is a high leve 
on II. Thus, the Thurstone method would find the relatively few primary 
or basic mental abilities, devise a scale to measure all of them, and sO 


? Most recently, some group-factor theorists have ch 
facilities of the mind and as media of expression. 
? A number of others hav 
K. J. Holzinger, H. Hotelli: 
J. P. Guilford, and P. Verni 


F E as 
aracterized primary factors 


€ made significant contributions to factor theory, especially 


ng, R. C. Tryon, H. E. Garrett, C. L, Burt, J. C. Flanagan, 
ion, 
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TABLE 7.4 


TABULAR REPRESENTATION OF THEORIES OF MENTAL ABILITY 


eS E i 


THE TWO-FACTOR PATTERN 
Test General factor Specific factors 


S, 
So 
Ss 
S, 
Ss 
Ss 


an ko N — 
WWW ^W €* * 


THE GROUP-FACTOR PATTERN 
Test Group factors 
Edi cde 


1 x x x 
2 x x x 
3 x x 
4 x x x 
5 x x x 
6 x x 


FACTOR THEORIES COMBINED 


Test General factor Group factors Specific factors 
Mc iei Lane 

ee oes 

l x x E 

2 x x S, 

3 x x Sg 

4 x x S, 

5 x x S; 

6 x x Ss 


Organize and score the subtests as to reveal each individual's relative 
Strength in each factor. 

Summary. Methods of factor analysis differ somewhat in their 
assumptions, and analysts differ somewhat in their interpretations or 
results, but the general conclusions derived by the several methods of 
analysis and interpretation do not differ radically. All factorial theories 
Now postulate the presence of group factors, although the groups are not 
always identical and differ in relative emphasis placed upon them by 
different theories. Most theories also find a general factor necessary to 
explain intercorrelations, although here again emphasis upon the general 
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factor varies. All agree that an individual's mental activities are attribut- 
able to the various ways in which the general and group factors reae 
in the performance of varied mental tasks. Although several ipee T 
factorial analysis are possible, basically the choice between them anc d 
terpretations derived through them must rest upon psychological ipio 
and concepts rather than upon statistical methods. Factors should m e 
regarded as fixed, predetermined mental entities. The factors that ane 
found are influenced by the ages of the persons tested, by interests, by e 
perience and training, and by the test items originally employed in s 
preliminary investigations. Factorial analysis is a statistical method that 


provides the means of improving test construction and of classifying test 
performance. 


Illustrations of Factors 


The following illustrations will assist the student to grasp more 
fully the problem of factors in psychological testing. As already stated, 
factor analysis techniques depend basically upon test intercorrelations. 
Tables 7.5 and 7.6 show the intercorrelations 
subtests of the Wechsler scale for children. 

Since all the coefficients in Table 7-5 are positive and quite marked, we 
conclude that all four tests have much in common, but that Vocabulary, 
Information; and Similarities have somewhat more in common with. one 
another than they do with Comprehension. But since all the coefficients 
are far from perfect (+ 1.00), we should use all four to sample the testee $ 
mental abilities, rather than only one or two to the exclusion of the 
others. The fact that these four tests are not perfectly correlated—o! 
nearly so—might be due to one of these possibilities: (1) that each test 
samples the g factor in different amounts, plus its own specific factor; (2) 


among two sets of four 


TABLE 7.5 


INTERCORRELATIONS OF Four SUBTESTS OF THE WECHSLER 
INTELLIGENCE SCALE FOR CHILDREN 
Vocab. Info. Sims. Comp. 


Vocabulary 44 .66 -60 
Information 67 61 
Similarities 61 
Comprehension 


Source: Manual, The Psychological Corporation. 
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that the tests have g in common but each test samples also one or more 
group factors, though not necessarily the same ones; or (3) that each has 
many highly specific factors in common with every other one, as well as 
unique factors. A technical factorial analysis would go beyond this in- 
spection analysis in an effort to determine which of these three hypotheses 
is the most plausible one, and to determine to what extent performance 
on each of the four tests calls upon whatever factors—g or others—might 
be inferred from the statistical analysis. 

Table 7.6, by contrast, shows four subtests that have low intercorrela- 
tions. Of the six coefficients, only two (.40 and .46) are large enough to 
suggest that the subtests involved have much in common in psychological 
functions. The coefficient of .46 between Comprehension and Arithmetic 


TABLE 7.6 


INTERCORRELATIONs OF FOUR SUBTESTS OF THE WECHSLER 
INTELLIGENCE SCALE FOR CHILDREN (continued) 


Obj. Assemb. Comp. Arith. Digit Sp. 


Object Assembly 13 .20 13 
Comprehension 46 .28 
Arithmetic 40 
Digit Span 


BE XR GLEN ee S| uu cci se ci 
Source: Manual, The Psychological Corporation. 


is attributable to the demands that both of these tests make upon reason- 
Ing ability, or, more specifically, ability to analyze a set of given material 
and then reorganize the elements toward the solution of the specified 
Problem. The coefficient of .40 between Arithmetic and Digit Span is 
attributable, it appears from the characteristics of the two tests, to facility 
With numbers and ability in immediate recall (as contrasted with delayed 
recall), The remaining four coefficients are so low as to suggest that the 
tests concerned have little dependence upon common functions (whether 
8 or other factors), and that each makes demands upon some factor or 
factors not called upon by the others. Here again, an analysis would at- 
tempt to identify more precisely the factors involved; but in so doing, the 
analyst would have to apply his knowledge of psychological functioning 
to ee items that cluster together, as shown by the analysis. i 
ini igure 7.1 shows in graphic form how two and three tests might be 
inten lated. As the number of types of tests increases, the possible factor 
errelationships may become more numerous and complex, though it is 
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i y sw be 
extremely improbable that no overlapping whatever of factors rnm 
found in measuring human abilities by means of two or more di ipe 
types of tests. In view of the more recent partial reconciliation o I 

k i ; : 
peopel and the general-factor theories, the illustrated overlapping: 
are most probably attributable to g. 


BH 

aian S E 
un 
IN] .. [N] 


H 


Fic. 7.1. Possible intercorrelations be- 
tween two and three tests 


The possible factor interrelationshi 


ps of the parts of Figure 7.1 are as 
follows: 


A. Each test is factorially independent of the other. The factor or factors 
in each are unique to it, either as group factors or as specific factors. 

B. The overlapping shaded area indicates a factor or factors common to 
both tests. This may be g or a group factor. The unshaded area indicates 
factors unique to each, either specific or group, or both. When many diverse 


tests show some overlapping among all of them, the soundest inference is 
that a g factor accounts for the common ground. 
C. In this instance the tests ma 


y be measuring only the general factor, 
or g plus the same group factor, 


or just the same group factor. There are "S 
unique factors, It is extremely improbable, in this situation, that the genera 
factor is not involved. If numerous pairings of different tests showed this 
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relationship, the soundest inference would be that a general factor is being 
measured. 

D. Each of the three tests is factorially independent of the other two. 
The uniqueness of each may be due to group or specific factors, or to both. 

E. The overlapping of 1 and 2 here may be attributable to g or to a group 
factor. Test 3 is independent of tlie others. The nonoverlapping segments 
of 1 and 2 may represent separate group factors or specific factors in each test. 

F. In this figure, overlapping between 1 and 2, and between 2 and 3, is 
attributed to one or more group factors, but different ones in each case, 
since there is no common ground between all three tests. The unshaded areas 
would represent special factors or group factors, or both, that are not shared 
with either of the other two tests. 

G. Here there is some common ground in all three tests (shaded area), 
which is interpreted as showing the presence of the g factor. The dotted 
areas show group factors shared by only two of the tests. The unshaded 
areas represent either specific factors or unique group factors, or both. 

. H. This figure represents three tests that have only the general factor 
in common, Both tests 2 and 3 have the same amount and the identical area 
in common with test 1; hence, they have the same amount and identical area 


in common with each other. 


H . 
These graphic illustrations of correlations and factor loadings derived 


from statistical analysis serve three purposes: (1) They demonstrate the 
Complexity of the problem of determining interrelationships of psycho- 
logical factors. (2) They demonstrate that the same statistical findings are 
often Open to more than one psychological interpretation; and, using the 
Statistical findings as aids, one’s interpretation will depend basically upon 
his Psychological analyses of intellectual functioning. (3) The illustrations 
and their several possible interpretations help to make clear the reasons 
Why the most valid and useful tests within a given category (for example, 
intelligence) have much in common regarding psychological functioning 


ang the test items themselves. 
eu! Figures 7.2 and 7.3 illustrate 
able si have been statistically fractionate 
tests Teo ene portions of each of the se 
78 Pk aee analyses provide insights into t i 
facili mbine in performance on the tests. Test con 
Stated, It should not be assumed, however, that each of these factors 
Pea 9r operates independently. We may look at the "puce in the Read- 
* & comprehension test as an example. We note that verbal comprehen- 
> ds the largest single factor; then we have, in order, “mechanical ex- 
perience,” “reasoning I,” and “reasoning II.” It is doubtful that these four 
actors can, or should be, separated functionally. Mechanical experience 


elaborate factorial analyses of 
d (2). These indicate the prob- 
veral factors in each of the 
he psychological operations 
ruction is thereby 
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DISCRIMINATION REACTION TIME 


Fic. 7.2. Diagrams of the component variances of three 
Army Air Force classification tests. From J. P. Guilford 
(2, p. 86). The letters stand for: 


V —verbal-comprehension factor 
ME —mechanical-experience factor 
R,—reasoning I (general-reasoning) factor 


R—reasoning II (common to analogies tests) factor 
V ,—visualization factor 


O —other common factors, each with v 
to mention separately 


—unknown common-factor or specific- 
ances 


E —error variances 
N —numerical factor 
$,—space I (spatial-relations) factor 
P —perceptualspeed factor 
MB —mathematical-background factor 
M,—memory II (visual-memory) factor 
PMs—psychomotor II (precision) factor 


ariance too small 


U factor vari- 


Á f 
ons (words and numbers). In this test a 
examinee’s ability to reason with the pro 
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NAVIGATOR CRITERION 


Fic. 7.3. Diagrams of the component variances of pilot 
and navigator training criteria. From J. P. Guilford (s, 
P. 86). Letter symbols are as defined with Figure 7.2, except 
for some additional ones: 

PI —pilot-interest factor 

M,—memory IV (content-memory) factor 
M,—memory III (picture-symbol association) factor 
PM,—psychomotor I (coordination) factor 

LE —length-estimation factor 


acquires and the degree to which he benefits from his mechanical ex- 
Perience will depend in part upon his present and potential reasoning 
Capacity. 


I Mplications 


4 The hypotheses as to the nature of mental abilities have been 
arrived at by means of several methods of statistical analysis and through 
Partially different interpretations placed upon similar data by different 
Investigators. Regardless of which of the hypotheses an author of a test 
follows, the instrument he develops will have much in common with those 
constructed by authors who base their tests on one of the other hypotheses. 

n many respects, the processes of standardization will be the same; the 
Same basic principles of constructing and testing will have to be ob- 
Served, A variety of mental activities will have to be sampled; in the case 
of the multifactor theory, in order to sample an adequate and representa- 
tive number of the many minute factors; in the case of the group-factor 
put in order to sample the primary abilities and those second-order 
fie y that might be found subsequently; in the case of the two-factor 

Ty, in order to get an adequate sampling of the general factor. 

: he main practical differences arising from the theoretical differences 

o found in the tests based on the group-factor theory, as compared 
Others. The differences will be as follows: 
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1. The parts of the test based on group-factor theory must correspond to the 
factors or primaries and they must try to measure these factors in as pure 

form as possible. 

ze The Sube in a scale based upon group-factor theory should have low 
intercorrelations. 

3- The test based on group-factor theory will emphasize the separate scores 
on each of the primaries and will provide a mental profile, even though 
the group-factor test might also provide an over-all index. 

4. The Binet type of test, and other g-factor tests, on the other hand, con- 
sists of a variety of test materials that are a composite of abilities, yield- 
ing a mental age and an intelligence quotient. Most group tests,?1 while 
arranging their items according to type (sentence completion, arithmetical 
reasoning, word meaning, picture completion, form perception), are not 
organized on the basis of specifically defined factors; and, like the Binet 
type, they generally yield a single index of relative rank. 


Since the several hypotheses regarding the nature of intelligence have 
thus far produced relatively few differences in practical test construction 
and application, the reader might well ask: 
with definitions and theories, 
ferent?" The answer to this q 
should be familiar with the 


background for his better un 


Creative Ability 


Creativity has long been a subject of interest to psychologists 
and others. Within about t 


What are the intellectual traits required for 
i at are the nonintellective personality traits re- 
quired?” 

Numerous introspective re 
of fields, are available; but t 
clues to the measureme 


ports, written by creative persons in a variety 
hey are not particularly helpful in providing 
nt of those human abilities that would enable 


= Excepting the “omnibus” Pé, in which ite 


R 5j 
placed in regular or irregular o; i i 
items of a single kind. RUNDE eda 


he student will fi iti 
nd i t i 
connection with Specific ta ual E o ri T0413 dn. lates E 


ms of various mental operations are 
ng grouped in subtests, each containing 
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psychologists and educators to select the potentially creative artist, sci- 
entist, mathematician, musician, or author. In fact, some creative indi- 
viduals are quite unable to explain their mental processes, even at a 
descriptive level. One author writes, “I suddenly get an idea for a novel. 
I start with some characters; others develop spontaneously. In a short 
time, the central characters have taken hold of me, and I live with them 
until the book has been finished." ?3 

Philosophical and psychoanalytical theories of creativity have been ex- 
pounded. One principle that seems to be widespread among these writers 
is that creativity does not occur in a vacuum; it occurs in areas of ex- 
perience, interest, and work to which the person has been “intensively 
committed in his conscious living" (1, p. 62). This trait alone is not suf- 
ficient, though it is necessary. 

Factor analysts, among psychologists, have applied their statistical tech- 
niques to test items, with the result that the factors thus far derived ap- 
pear to be strikingly similar to those found for general ability, productive 
thinking, and scholastic aptitude. At most, all that can be said of the find- 
ings of factor analysis in this elusive and tenuous field of research is that 
they might identify and name some of the mental tools used by the crea- 
tve person, but they do not provide insights into how he employs them 
to produce his results. Furthermore, it is necessary to differentiate among 
Creative abilities in the several fields, each of which has some elements 
in common with the others; but each also has its own special requirements 
and elements. Nor have studies of nonintellective personality traits of 
Creative persons succeeded in differentiating them from other groups of 
Superior individuals. 

Available psychological tests can only reveal what levels of general in- 
tellectual ability are demanded and what levels of particularized abilities 
(verbal, mathematical, spatial, auditory, visual, immediate and remote re- 
call, etc.) are essential in each of the creative fields. In addition, improved 
Personality inventories and projective tests might reveal which traits, if 
any, are essential to each field and will differentiate individuals in one 
type of creativity from those in the others. Thus far, these objectives have 
not been achieved. 

Available psychological tests of mental ability have been criticized for 
not Measuring creative ability. The criticism is unwarranted, because 
these tests are not intended to measure it and because the essential nature 
Of the standardized test does not permit individualized or unique re- 
nd Tests of general intelligence, however, have made this con- 

ution: the creative individual is in most cases a person of superior or 
Sifted general mental capacity as measured by sound tests; but not every- 


Y Persona] : " 
communication. 
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one who attains the superior or gifted level on tests will prove to be 
creative. The principal reason for this might be a matter of nonintellec- 
tual personality traits. Other than this observation and the reports given 
by creative persons, educators and psychologists, for the present, will de- 
pend largely upon identifying creative children and adolescents by their 
actual productions, 

Probably the most extensive psychological studies of creative persons, 
in a variety of fields, have been in progress at the Institute of Personality 
Assessment and Research at the University of California in Berkeley, On 
the basis of six years of research, D. W. MacKinnon states: 


It is quite apparent that creative persons have an unusual capacity to 
record and retain and have readily available the experiences of their life 
history. They are discerning, which is to say that they are observant in a 

they are alert, capable of concentrating attention 
appropriately; they are fluent in scanning thoughts 
at serve to solve the problems they undertake; and, 


ave a wide range of information at their command. 
ntelligent person, 
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THE BINET SCALES 


Binet's Early Work 


The historical background of the Binet scales has been pre 
sented in Chapter 1. It will be recalled that their development Was me 
tivated by the interests of Psychologists in measuring individual differ- 
ences in mental abilities. Galton, whose publications are reported in the 
first chapter, assumed that simpler measurable sensory capacities would be 
significantly correlated with intelligence and that if these simple sensory 
measures were obtained, they would afford a means of judging and pre- 
dicting an individual's intellectual capacity. Although it has long been 
demonstrated that measures of sensory capacities have no value for evalua- 
tion of the higher, complex processes, Galton's work greatly affected the 
nature of test experimentation in the United 
that time the influence of Alfred Binet becam 

The relatively simple tests of sensory, 
proved to be of little value as measures to reveal intelligence. In the first 
piace, their intercorrelations were very low, ranging generally from zero 
to only .20. And, in the second place, the results of these tests, when cor- 
related with academic Performance, yielded correlation coefficients of 
much the same magnitude, many of them being less than .10—hence ud 
less for Purposes of prediction. As a matter of fact, experimentation in 
the years that followed the early investigations has consistently confirmed 
the negligible or very low correlations found to exist between sensory and 

184 


States until about 19o0. At 
e apparent. V 
motor, and memory capacities 
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motor capacities, on the one hand, and the higher more complex func- 
tions, on the other. 

It is now generally recognized by psychologists that intelligence has 
little relationship to the elementary sensory and motor processes, and 
but a very moderate relationship indeed to capacity for rote memory (a 
correlation of about .g0), Many infrahuman animals have keen sensory 
discrimination. Mentally deficient children in the higher levels of defect 
and children in the borderline group are not very inferior to normal 
children in respect to skin sensitivity, visual acuity, auditory acuity, and 
reaction time. Nor are intellectually gifted children superior to average 
in these respects. But in the capacities to learn, to organize and direct 
thinking, to adapt behavior, to comprehend problems and deal with ab- 
Stractions, in levels of acquired information, in extent of curiosity about 
one's environment, these groups differ very markedly. 

'The reader should bear in mind these early kinds of tests, not only 
for historical purposes, but also in order to compare the early efforts 
with currently available tests and to be more clearly aware of the direction 
in which psychological testing has been moving. 

The P dimmi Binet ic along the same lines as that of the Amer- 
ican and German psychologists mentioned in Chapter 1. He used tests 
of tactual discrimination, reaction time, visual discrimination, auditory 
discrimination of time intervals, reproducing letters and numbers from 


Fic. 8.1. Alfred Binet (1857-191 1) 
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memory, and so on. But though he experimented with these materials 
until about 1900, some years earlier he had begun to doubt the value of 
continuing with them. 

Although some of the mental activities that Binet proposed to measure, 
and with which he was experimenting, were as yet vague, he did never- 
theless point out the direction in which mental tests should and in fact 
did develop. 

Binet and his collaborators objected to the kinds of psychological tests 
that followed Galton's work, on the ground that they were too simple 
in character and would contribute little to the understanding of differ- 
ences among persons in respect to the higher mental functions. Binet 
i ; that intelligence is expressed not in the form 
5, but rather as a combined mental opera- 
ses are involved operate as a unified whole. 


ctions that individual differences are most 
marked; it is these that distinguish individuals most significantly and 


already discussed in 
Standard scores 
The thirty it 


Scale are as follows: 2 ing difficulty, included in the 1905 
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1. Visual coordination—degree of coordination of movement of head 
and eyes as a lighted match is slowly moved before subject's eyes. 

?. Prehension provoked by tactual stimulus—a small cube of wood is 
placed on back or palm of the subject's hand to see if he grasps it and carries 
it to his mouth, and coordination of movements is to be noted. 

3. Prehension provoked visu:illy—cube of wood is placed within subject's 
reach by examiner who notes whether subject grasps it. 

4. Recognition of food—a small piece of chocolate and a piece of wood 
of same dimensions are shown successively, and signs of recognition of food 
and efforts to take it are noted. 

5. Seeking food when slight mechanical difficulty is interposed—a small 
piece of chocolate, wrapped in a piece of paper, is given to the subject, and 
his manner of separating the food from the paper is noted. 

6. Execution of simple directions and imitations of simple gestures. 

7. Verbal knowledge of objects—parts of the body (head, ear, nose, etc.) 
are indicated by the subject, and common objects (key, string, cup) are 
handed to examiner on request. " 

8. Verbal knowledge of objects in a picture, as shown by pointing out 
Objects, the names of which are given. 

9. Naming of objects designated in a picture. 4 Ap 

10. Comparison of the lengths of two straight lines, pointing out the 
longer, 

11. Repeating three digits immediately after hearing the series once. 

12. Comparison of weights; identical-appearing blocks of wood weighing 
3 and 12 grams, 6 and 15 grams, 3 and 15 grams. k | 

1g. Suggestibility—asking the subject for an object that is not present 
(modification of 7); asking subject to point to a nonexistent object in a 
Picture, designated by a nonsensical word (modification of 8); comparison 
of lines of equal length (modification of 10). 

14. Definitions of familiar objects, such as house, horse, fork, ) 

15. Repetition of sentences having fifteen words each, after hearing each 


9ne only once, À 
16. Giving the differences between two common objects; for example, 


Wood and glass, a fly and a butterfly. ; . i 
17. Immediate recall of pictures of familiar objects—pictures of thirteen 
common objects are shown for thirty seconds, after which the subject names 


as many as he can recall. 1 i 4 
18, Drawing from memory two different geometric designs which have 


een shown simultaneously for ten seconds. k A 

19. Repetition of series of digits, beginning with a series of three and 
Proceeding until the subject's limit is reached. - à 

20. Giving resemblance between common objects; for example, a wild 
Poppy and blood; an ant, a fly, a butterfly, and a flea. l 

21. Rapid comparison of lengths of lines: a line of go cm. is compared 
With fifteen others varying from g1 to 35 cm.; then a line of 100 cm. is com- 
Pared with twelve others varying from 101 to 103 cm. 
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22. Discriminating and arranging in order five weights—3, 6, 9, 12, 15 
ing of equal size. 

E cite one of the weights in test 22 is removed, the re- 
maining weights are scrambled, and the subject is asked to identify the 
missing weight or gap in the series. 

24. Giving rhymes to selected words. 

25. Sentence completion—supplying the correct word to complete a sen- 
tence. 


26. Devising a sentence to include three given words; for example, Paris, 
gutter, fortune. : 

27. Comprehending and giving replies to twenty-five problem questions 
graded in difficulty—What is the thing to do when you are sleepy? Why is 
it better to continue with perseverance what one has started than to abandon 
it and start something else? 

28. Reversing the hands of a clock, to be done from memory; for example, 
giving the time it would be if the large and the small hands were inter- 
changed at four minutes to three. The subjects who succeed are given the 
more difficult problem of explaining why the precise transposition indicated 
is impossible. 

29. Drawing lines to show the folds and cutout of a piece of paper that has 
been quarto-folded and from which a triangular piece has been cut. 


30. Giving definitions and distinctions between paired abstract terms; for 
example, sad and bored. 


Although this set of tests was not separated into age groups, Binet did 
indicate several differentiating levels. Question number 6 was the upper 
limit of idiots (adult); question 9 was the upper limit of ordinary & 
year-old children; number 14 was the limit of ordinary 5-year-old chil- 
dren; number 16, that of imbeciles (adult); test 23, the most probable 
limit of morons (adult), although test 27 was regarded as having great 
value in revealing the moron. In addition, the authors reported a number 
of qualitative and quantitative differences in replies to many of the ques- 


tions, thus distinguishing between 7- and 9-year levels, on the one hand, 
and g- and 11-year levels, on the other. 
The order of tests in the 1 


B 905 scale was experimentally determined, 
for it was established after 


being used with children in the primary 
: j for the mentally deficient (the Salpetriére). 
The children in primary school were regarded as normal on the basis of 
the fact that they were in grades just right for their ages—neither ad- 
vanced’ nor retarded. Binet and Simon report that many such children 
Were tested, but norms for the scale Were based upon records of only 
ten cases in ea 


ch of the following age groups: n 
PS: 3, 5, 7, 9, and 11 years 
ttedly rather crude and te dtt 3 


Pei , and morons in a more 
objective manner than had been possible before 


THE SCALES 189 


Furthermore, in the foregoing list of thirty items, the reader will find 
many types which have since been developed, standardized, and in- 
cluded in a large number of current psychological tests, from those de- 
signed for babies to those intended for adult levels. 

It is significant to note, also, that while Binet wanted to devise a scale 
that would yield age ratings, he was equally concerned with the quality 
of judgment and reasoning shown by the subject in the course of the 
examination. Binet was thus using the test situation as an opportunity 
for a clinical interview—a practice which is becoming increasingly wide- 
spread and of increasing importance in reports of psychological examina- 
tions by present-day clinical psychologists. 

The 1908 Binet-Simon Scale. Binet and Simon recognized the 
defects of the first scale. They recognized that an improved scale would 
have to provide more valid norms, based upon a larger and more repre- 
sentative sampling of children at each age; that tests for each age within 
the limits of the scale would have to be included to achieve finer units 
of measurement and greater accuracy. Their own subsequent investiga- 
tions and those of other psychologists resulted in a new form of the test, 
known as the 1908 scale, in which the items are grouped at the appro- 
priate age levels, from g years to 13 years (2, 3). 


Age 3 
1. Points to nose, eyes, mouth. 
2. Repeats sentences of six syllables. 
3. Repeats two digits. 
4. Enumerates objects in a picture. 
5. Gives family name. 
Age 4 
1. Knows own sex. : : 
2. Names certain familiar objects that are shown to him (key, knife, penny). 
3. Repeats three digits. 3 . 
4. Perceives which is the longer of two lines 5 and 6 cm. in length. 


Age 5 
1. Indicates the heavier of two,cubes (3 and 12 grams; 6 and 15 grams). 


2. Copies a square. ; " 

8. Constructs a rectangle from two triangular pieces of cardboard, having a 
model to look at. 

4. Counts four coins. 

5. Repeats a sentence of ten syllables. 


Age 6 
1. Knows right and left; indicated by showing right hand and left ear, 
2. Repeats sentence of sixteen syllables. 


I og Ro 


THE BINET SCALES 


. Chooses the prettier in each of three pairs of faces (esthetic comparison). 
. Defines familiar objects in terms of use. 

. Executes three commissions. 

. Knows own age. 


. Knows morning and afternoon. 


Age 7 


1. Perceives what is missing in unfinished pictures, 


n 


ime) wi ON Ov ga oo 


Q»g d Qv does 


. Knows the date: day 
- Recites days of week. 


- Reads a passage and remem 
. Arranges five equal- 


- Names the months of the 
- Recognizes and names nin 
. Constructs a sentence in whi 


. Knows number of fingers on each hand and on both hands without 


counting. 


- Copies a written model (“The little Paul"). 
. Copies a diamond. 

. Describes presented pictures. 

. Repeats five digits. 

- Counts thirteen coins. 


- Identifies by name four common coins. 


Age 8 


- Reads a passage and remembers two items. 
- Adds up the value of five coins. 


Names four colors: red, yellow, blue, green, 


- Counts backwards from twenty to zero. 
- Writes short sentence from dictation. 
. Gives differences between two objects, 


Age 9 


of week, day of month, month of year. 


perior to use; familiar objects are employed. 
bers six items, 


appearing cubes in order of weight. 


Age 10 


year in correct order, 
€ coins, 


ch three given words are used (Paris, fortune, 
gutter), 


uestion between ages 10 and 
11. Only about one half of the 10- E 3 


AE year-olds got the majority of these cor- 
Age 11 

+ Points out absurdities in statements, 

* Constructs a 


Sentence, includi 


à ng three given wo 
in age 10), B g words (same as number 3 
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. Gives any sixty words in three minutes. 
. Defines abstract words (charity, justice, kindness). 
. Arranges scrambled words into a meaningful sentence. 


Cro GO 


Age 12 

1. Repeats seven digits. 

2. Gives three rhymes to a word (in one minute). 

3. Repeats a sentence of twenty-six syllables. 

4. Answers problem questions. 

5. Interprets pictures (as contrasted with simple description). 

Age 13 
1. Draws the design made by cutting à triangular piece from the once-folded 
edge of a quarto-folded piece of paper. - ! 
2. Rearranges in imagination the relationship of two reversed triangles and 
draws results. 

3. Gives differences between 

There are several obvious differences between the 1905 scale and that 
of 1908. In the former, there are thirty test items; in the latter, fifty-nine. 
The latter does not include the first six items of the 1905 scale, which 
are at the infant level; some other items of the 1905 scale have been 
eliminated, and many new ones have been added. As compared with the 
1995 scale, the age range extends higher in the 1908 scale. There are 
Specific groups of items for each age (thus permitting a more accurate 
rating of individuals), and a greater variety of mental processes is tested. 

In the 1908 scale, there are also two new and significant contributions 
to the theory and practice of mental testing and test construction: (1) 
the tests, after experimentation, were standardized by being grouped 
into appropriate age levels (Binet's method is explained below); (2) the 
Concept of mental age is employed for the first time? j 

The principal criterion employe’. by Binet and Simon in the stand- 
ardization and age placement of tests was this: in general, a test was 
Placed at the year level where it was passed satisfactorily by two thirds 
to three fourths of a representative group of children of that age. The 
ideal standard was to place a test at a year level where it was passed by 
Seventy-five percent of that age group. Binet’s reason for setting this 
ideal criterion is a sound one and is made clear by reference to a sym- 
metrical bell-shaped curve, which is approximated by most distributions 
of intelligence-test scores. The middle fifty percent of the group are most 
Nearly alike, most nearly homogeneous in respect to the abilities being 
Measured, as is obvious from the concentration of these fifty percent 
within a relatively narrow range, or variation, of scores. Otherwise stated, 
those individuals constituting the middle fifty percent of the distribution 


pair of abstract terms: pride and pretension. 


* Although mental age is employed here for the first time, the concept itself had been 
Proposed by Binet in 1905. 
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are the typical persons of the age group; hence, their test performance 
should be regarded as typical or normal for their age. If the middle fifty 
percent of a given age are able to pass a test, then that same test can be 
passed by the twenty-five percent who are above the middle group 1n 
ability, making a total of seventy-five who are able to pass the test. 

In actual experience, however, it has been practically impossible to 
devise tests that will exactly satisfy this criterion of seventy-five percent 
passing. Fortunately, there are other criteria of validity that are also of 
primary significance, so that tests are retained if they approximate the 
seventy-five percent criterion, and if they demonstrate their value by also 
satisfying other demands, such as distinguishing between groups of In 
dividuals of known ability (mentally deficient, average, and superior), 
showing appreciable or significant differences between percentages pass- 
ing at successive age levels, and correlating fairly well with scholastic 
achievement. These aspects of validation are more fully discussed in the 
following chapter, in connection with revisions of the Binet scale. 

Binet and Simon standardized their 1908 scale after individual ex- 
aminations of 203 Paris school children between the ages of 3 and 13 
years. Although this number is small and would be regarded as inade- 
quate in present-day test-standardization procedures, the fact is that these 
French pioneers did set a pattern of standardization that is being fol- 
lowed today, with considerable statistical refinement. For in addition to 
having suggested the criteria already mentioned, they also, in effect, used 
the symmetrical bell-shaped curve as a criterion, though without offering 
precise numerical values. They stated, simply, that the number of chil- 
dren testing above age (superior) should equal the number testing below 
age (inferior), and the number testing at age, or normal, should be greater 
than the number who rank as either superior or inferior. 

The mental age, with the 1908 scale, was found as follows: first, the 
subject was credited with the age level at which he passed all tests. To 
this basic level (now called the “basal year") an additional year's credit 

was added for every five tests passed at higher levels. The total was the 
subject's mental age. No credits were given for a fraction of a year; but 
in the 1911 scale (see below) the calculation of mental age was modified 
so as to include fractional parts. The reader will note that this method 
of deriving mental age is essentially the same as that used with the Amer- 
ican age-scale revisions. 

In spite of its imperfect standardization, in the 1908 Binet-Simon scale 
and in the publications concerned with it will be found many of the 
important concepts and practices which have been employed since then 
in the construction and use of psychological tests. 


The 1911 Revision of the Binet Scale. The 1908 scale created 
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considerable interest among psychologists in Belgium, Germany, England, 
Italy, Switzerland, and the United States. Their interest resulted in a 
number of valuable applications and evaluations of the Binet scale, ac- 
companied by suggestions for revisions. 

For the most part, criticisms and suggestions dealt with the age levels 
at which various items had been placed. It is not surprising that, in the 
first age scale devised to measure intelligence, further and extensive 
applications and analysis of results should have revealed that a number 
of the items were misplaced. The principal criticism was that the tests 
at the lower age levels were too easy, whereas those at the higher levels 
were too difficult, with the result that the former group were rated too 
high, while the latter were rated too low. In other words, standardization 
of the test had to be improved. Binet utilized the suggestions and criti- 
cisms of other psychologists, as well as the results of his own continued 
researches on the 1908 scale, the result being the 1911 revision. 

Specifically, the major changes incorporated in the 1911 scale were the 
following: four of the tests at the 11-year level were raised to the 12-year 
level; all 12-year tests were raised to the 15-year level; the three tests of 
year 13, plus two new ones, constituted the new adult level. Here and 
there, also, a few tests were placed in either a higher or a lower age level. 
No tests were provided for the 11-, 13> and 14-year levels.* In addition 
to these changes, several tests found in the 1908 scale were omitted from 
the 1911 scale because they seemed to depend too much on school learn- 


Ing or on incidental information. 


At age levels 3, 4, and 5, the tests are the same as in the 1908 version. 


Age 6 
- Distinguishes between morning and afternoon. 
. Defines names of familiar objects in terms of use. 
- Copies a diamond. 
- Counts thirteen sous. 
Distinguishes between pictures of ugly 


Age 7 


oR ow N m 


and pretty faces. 


Shows right hand and left ear. 

Gives description of pictures. 
d pT ly. 

Executes three commissions given ig RSE 

Gives value of three single- and three Sie 

Names four colors: red, green, yellow, DINE: 

ment appears to decrease appreciably after 


age 10, it becomes difficult to devise tests that will distinguish adequately between yearly 


e DET the authors of the first Stanford revision of 
vels. This difficulty was encountered also Bh iid Teróh gans reden ol 


€ Binet-Si 6); but in the sec 
So iodides ey seeks levels between ages 1° and. 14 (see Chapters 9 and 10). 


[gu 


* Inasmuch as the rate of mental develop: 
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Age 8 
Gives differences between two objects (from memory). 
Counts backward from twenty to zero. 
. States omissions from unfinished pictures. 
. Knows the date. 
. Repeats five digits. 


apport 


Age 9 
. Makes change from twenty sous. 
. Defines names of familiar objects in terms superior to usc. 
. Recognizes all nine [French] coins. 
. Gives months of the year in correct order. 
. Comprehends and answers easy problem questions. 


[12 


Age 10 
. Arranges five blocks in order of weight. 


Reproduces two geometric designs from memory. 
. Criticizes absurd statements. 


. Comprehends and answers difficult problem questions. 
. Uses three given words in two sentences. 


ok oO Nn r- 


The method of scoring the 1911 scale was modified so that fractions of 
a year could be used in determining the mental age. Since there were 
five tests at each age level (except at age 4), each counted as two tenths 
of a year. 'Thus, if a child passed all tests at age 6, two at age 7, and one 
at age 8, his mental age would be 6.6 years. 

According to Binet, a child whose mental age is equal to his chrono- 
logical age is considered "regular" in intelligence; one whose mental age 
is higher is called "advanced"; and one whose mental age is lower is 
called "retarded." The degree of advancement or retardation in any in- 
stance is dependent upon the extent of the difference. Within about a 
year, however, William Stern was to suggest the use of the intelligence 
quotient, which has since been widely employed to indicate degree of 
acceleration or retardation in intelligence. 

Binet's scales of 1908 and 1911 provided the stimulation and the basis 
for several adaptations and revisions in the United States. The authors 
of American revisions utilized Binet's principles and drew freely on his 
tests, as well as adding new ones and standardizing their instruments 


specifically for American children. These revisions are presented in the 
following chapter. 


Summary 


7 At this point we may very briefly summarize Binet's major 
contributions to the theory and practice of intelligence testing. 
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1. If a psychologist is to develop a test of intelligence, he should first 
formulate a working conception and definition of intelligence, and then 
proceed experimentally. As a result of experimentation, new hypotheses 
will be developed; these in turn will influence later test construction; 
thus, both the conception and measurement of intelligence will undergo 
improvement and refinement. Binet's own conception of intelligence in- 
cluded mainly the following characteristics: ability to reason and judge 
well, to comprehend well, to take and maintain a definite direction of 
thought, to adapt thinking to the attainment of a desirable end, and to 
be autocritical. 

2. Intelligence must be measured by testing the higher, complex mental 
processes rather than the relatively simple sensory and motor activities. 

3. Intelligence, being a complex, can be tested only by the use of a 
diversity of materials devised to evaluate the operations of mental proc- 
esses as an integrated unit, rather than by measuring the separate ele- 
ments that might contribute to the complex functioning of intelligence. 
Though the Binet tests seem to be simple in conception and construction, 
they actually involve many complex mental activities: memory of several 
Kinds, apperception, free association, orientation in time, language com- 
prehension, ability with numbers, knowledge about common objects, 
constructive imagination, comparison of concepts, perception of contra- 
dictions, understanding of abstract terms, ability to meet novel situations, 
and combining fragments into a meaningful whole. 

4. The tests included must be appropriate to the environment of those 
for whom they are intended. i 

5. The tests were arranged in the form of a scale, from easiest to most 
difficult, and groups of tests were placed at appropriate age levels. The 
criterion was, ideally, that a test should be placed at a level where it 
was passed by three fourths of that age group. 

6. The concept of mental age was introduced. 

7. The tests must be so standardized that the large middle group of 
average children (in the curve of distribution) will test “at age.” 

8. Other criteria of validity were introduced, such as known groups, 
scholastic ratings, and increase in percentage passing a test at successive 
age levels. 

9. The need of establishing the reliability of a test was recognized; 
Binet, therefore, made a few reliability studies with his 1911 scale. 

Not only did Binet make these contributions; he also indicated the 
extensive uses to which psychological tests could be put in educational, 
social, vocational, and theoretical problems, for he regarded tests as tools 
for research and for scientific solution of important practical problems. 
Indeed, many of the researches and uses to which tests have since been 
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applied are definitely along lines indicated by Binet, including, ies 
others, the testing of prospective soldiers in order to eliminate the men 
fit. . 
TE did not regard his tests as final or as quite satisfactory; he did 
not claim that they measured all aspects of personality; he emphasized 
that they must be supplemented by psychological and educational infor- 
mation derived by other means and from other sources. He did claim— 
and in this he has been supported by extensive subsequent use—that his 
test, and improved versions that should follow, can provide a very useful 
and reasonably valid index of an individual's general intelligence, when 
the tests are administered and interpreted by qualified examiners.’ 
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EARLY REVISIONS OF THE 
BINET-SIMON SCALE 


This chapter will be devoted principally to the 1916 version of 
the Stanford-Binet scale. Although it was the first of three Stanford scales 
(1916, 1937, and 1960), students, particularly at the graduate level, and 
those who will use tests or who must be familiar with them in their 
Occupations should be knowledgeable regarding earlier efforts. They will 
thereby be able to understand more fully the development and progress 
made over the years. They will have more information regarding the 
origins of current techniques and procedures; they will be better able to 
evaluate progress made in current tests. 


Four Early Revisions 


The two most widely known and used adaptations of the Binet 

scale in the United States were the Stanford revisions of 1916 and 1937. 
here were, however, four other revisions that, at one time or another, 
Were of value to psychologists but which today are not employed and 
are chiefly of historical interest. In 1908, H. H. Goddard published a 
translation of Binet's 1905 scale; and in 1911 he produced, for use in the 
United States, a revision of Binet's 1908 version. Yerkes published revi- 
Sions in 1915 and 1923, in which the several types of items were grouped 
as subtests in a point scale (for example, memory span for digits, analogies) 
Instead of being placed at age levels. Herring's revision appeared in 1922 
and for some years was used as a valuable alternate in place of the 1916 
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Stanford scale. Kuhlmann's three revisions (1912, 1922, and 1939) were 
extensive and elaborate in respect to standardization, scoring, and age 
range covered. Thus, it is clear that a significant amount of psychological 
work had been done prior to the publication of the 1916 Stanford scale 
and that several investigators continued their research and improvements 
on Binet’s instrument for some years afterwards. 


The Stanford Revision of 1916 


The full name of this test, The Stanford Revision of the Binet- 
Simon Intelligence Scale, is derived from the fact that the revision was 
made at Stanford University, under the direction of L. M. Terman. The 
construction of this scale was undertaken for the purpose of providing 
an instrument that would be adequately standardized and adapted for 
use in the United States. Its acceptance by psychologists and educators 
is attested by the fact that it wes the most widely used individual scale 
until the revised Stanford-Binev appeared in 1937. 

Although Terman and his collaborators examined approximately 2300 
subjects—1700 normal children, 200 defective and superior children, and 
400 adults—over a period of several years, the revision of the scale below 
the 14-year level was actually based upon the results obtained with about 
1000 native-born children in California. Each one of these children, 
representing an unselected group of average social status, was within 
two months of his birthday. 

The 1916 scale includes go test items, covering an age range from 
3 years to 14 years, with a group of test items added at the “average adult” 
level and another at the “superior adult” level. Of these go test items, 
54 were adapted from the 1911 Binet scale, 5 from earlier Binet scales, 
4 from other American tests, and 27 were new additions. 

VALIDATION. The process of selecting the items involved (1) the com- 


That is, a correct scale must cause 
test exactly at 5 (MA), the average 
(5. p. 53). Or, in terms of the intell 


scale, an unselected group of childre 


n at each age should yield a median 
of 100, 


Befor, : š 
e the desired results were secured and this criterion satisfied, it 
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TABLE 9.1 


PERCENTs PassiNG Trsts LOCATED IN YEAR VI: 
1916 REVISION 


Test Ages 

$5108. 17. 8 
Right and left 40 50 71 86 95 
Mutilated pictures 27 50 65 87 96 
Counting 13 pennies 30 46 76 93 86 
Comprehension 25. 55 70 86 93 
Naming 4 coins 95 47 74 9l 395 
Repeating 16-18 syllables 34 56 69 90 95 


Source: L. M. Terman et al. (6, pp. 167-168). By permission. 


Was necessary to prepare three revisions of the scale. This involved the 
elimination of some test items, the shifting of others up or down in age 
level, and changes in scoring standards. "As finally revised," Terman 
States, "the scale gives a median intelligence quotient closely approx- 
imating 100 for the unselected children of each age from 4 to 14." 


The test items above the age of 14 were based on examinations of 


30 businessmen, 150 migrating unemployed men, 150 adolescent delin- 


quents and 50 high school students. These groups are not a representative 
Cross section of persons above 14 years of age in the general population. 
This fact will help make clear why the 1916 scale was found to be un- 
Satisfactory for use with older adolescents and with adults.? The unsatis- 
factory quality of the scale at the upper ages was due also to inadequate 


Sampling of abilities. H : 
In addition to the criterion of a significant increase in the percent 
Passing a test item at successive ages, the criteria which follow were used 


in establishing validity of the 1916 scale. 


of passes for the placement of a test of a given age 


*A theoretical or ideal percentage that ti 
no satis- 


level was not used. Terman states, "We had already become convinced ++. thal 
factory revision of the Binet scale was possible on any theoretical considerations as to 
the Percentage of passes which an individual test ought to show in a given year to be 
Considered standard for that year" (5, P- 54)- Accordingly, a “trial-and-success method 
Was used in order to get the desired median mental age and IQ at each chronological 
age level. The same practice was followed in standardizing the 1937 revision. 

^In a personal communication Dr. Terman states that there were three tentative 
Versions of the scale before the final one was published. The businessmen and high 
School students were used in making the first tentative placement of tests at average and 
Superior adult levels. The other adult groups were then used in subsequent rearrange- 


ment of test items, 
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First, in each age group, all the subjects tested were divided into the 
following three classes: those testing below go IQ, those testing between 
go and 109, and those testing 110 or above. Each test item was then 
examined to determine whether it was passed by a "decidedly higher" 
percentage of individuals in the superior 1Q group than in the inferior. 
(The term "decidedly higher" was not defined by Terman.) Only those 
test items that satisfied this criterion were retained. The data shown in 
Table 9.2 are illustrative. 


TABLE 9.2 


PERCENTS PASSING CERTAIN TEsts, 
CHRONOLOGICAL AGE CONSTANT 


Age Test Below | 96- | Above 
96 — 105 105 
6 Counting 13 pennies 40 77 96 
7 Describing pictures 48 52 80 
8 Giving similarities 44 57 83 
9 Making change 39 60 73 
10 Comprehension of problem situations 25 64 76 


Source: L. M. Terman et al. (6, p. 133). By permission, 


Second, after the scale had been developed, the IOs obtained with 
504 school children were compared with their scholastic ratings, as graded 
by their teachers, on a five-point scale; namely, very inferior, inferior, 
average, superior, very superior. Moderate agreement was found between 


intelligence quotients and school ratings, the coefficient of correlation 
being .48—close enough so that Terman 
there was no justifia 


: ble "serious suspicion 
Igence scale." 
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RELIABILITY. Following its publication, the Stanford-Binet was sub- 
jected to numerous studies in order to determine its reliability by the 
method of self-correlation. The correlation coefficients, which in such 
studies will vary with the size and constitution of the experimental 
group, ranged from about .80 to .g5. Such coefficients are regarded as 
highly satisfactory indexes of reliability. 

The Scale. The reader will have noted that no essentially 
new concepts or principles have been added in the 1916 Stanford-Binet 
scale, as compared with Binet's own. Terman and his colleagues did, 
however, extend, refine, and adapt the Binet scales, so that the 1916 
revision was a better standardized, hence more valid and reliable, instru- 
ment. 

The complete list of tests of the 1916 Stanford-Binet follows. Through- 
out the scale, those items designated "AL," instead of being numbered, 
are alternates, to be used in place of one of the numbered items, where 
the examiner, for any reason, believes a numbered item to be inappro- 
priate, 

Age 3 
1. Points to parts of body. 
2. Names familiar objects. 
3. Enumerates objects in pictures. 
4. Gives sex, 
5. Gives last name. 
6. Repeats six to seven syllables. 
Al. Repeats three digits. 
Age 4 
1. Compares lengths of lines. 
2. Discriminates between geometric forms. 
3. Counts four pennies. 
4. Copies a square. NT. 
5. Comprehends and solves problem situations. 
6. Repeats four digits. 
Al. Repeats twelve to thirteen syllables. 


Age 5 


- Compares weights. 
- Names familiar colors. r $ T 
- Makes esthetic comparisons of paired drawings of faces. 


1 
2 
3 
4. Defines common words: use or better. 
5. Puts together a divided triangle. 

6. Carries out three commissions. 

Al. Gives own age. 


* Reproduced by permission of Houghton Mifflin Company. 
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Age 6 
. Knows right from left. 
- Perceives missing parts in pictures. 
- Counts thirteen pennies. 
. Comprehends and solves problem situations. 
. Identifies coins. 
- Repeats sixteen to eighteen syllables. 
Al. Knows morning from afternoon. 


AT wo n» 


Age 7 
. Knows number of fingers on each and both hands. 
Describes pictures, 
- Repeats five digits. 
. Ties a bowknot. 
- Gives differences between paired objects. 
- Copies a diamond. 
Al.i. Names days of week in correct order. 
Al.2. Repeats three digits backwards, 


Aas o» 


Age 8 
1. Traces path to be followed in a systematic search for a lost object in a 
field. 
2. Counts backward from twenty to one, 
3. Comprehends and solves problem situations, 
4. Gives similarities between two things. 
5. Defines names of objects in terms superior to use, 
6 


- Defines twenty words from a vocabulary list. 
Al.1. Identifies six coins, 


Al.2. Writes short sentence from dictation. 


Age 9 
1. Gives date: day of week, month, day of month, year, 
2. Discriminates between weights: 3, 6, 9, 12, 15 grams. 
3- Makes change in small amounts. 
4. Repeats four digits backwards, 
5. Makes up a sentence including three given words. 
6. Gives rhymes to three words, 
Ali. Names the months of the year. 


- Gives total value of a group of one-cent and two-cent postage stamps. 


Age 10 
- Defines thirty words from vocabulary list, 
- Detects absurdities in statements. 
- Reproduces two designs from memory. 
- Reads a short Passage and reproduces content. 
- Comprehends and solves problem situations, 


QUO N 
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6. Names any sixty words by free association. 
Ali. Repeats six digits. 
Al.2. Repeats twenty to twenty-two syllables. 
Al.3. Fits rectangular blocks into formboard. 


Age 12 
1. Defines forty words from vocabulary list. 
2. Defines abstract words. 
3. Traces a path in systematic search (same problem as in year 8, but a 
superior plan is required here). 
. Rearranges dissected sentences into meaningful sentences. 
. Interprets fables. 
. Repeats five digits backwards. 
. Interprets pictures. 
. Gives similarities between three things. 


ON Oo a 


Age 14 
1. Defines fifty words from vocabulary list. 
2. Discovers a rule in a paper-folding test (induction test). 
3. Gives differences between a president and a king. [ 
4. Integrates given facts and arrives at a conclusion concerning them. 
5- Solves arithmetical reasoning problems. 
6. Reverses hands of ciock, in imagination, and gives the hour. 


Al. Repeats seven digits. 
Average Adult 


1. Defines sixty-five words from vocabulary list. 

2. Interprets fables. 

3. Gives differences between abstract words. 

4. Solves problem of number of enclosed boxes ( 

shown only the large outside box. 

5. Repeats six digits backwards. , 

6. Perceives the pattern of a code and uses it. 
Ali, Repeats twenty-eight syllables. 
Ala. Comprehends problems involving p. 


Superior Adult 


f vocabulary list. 
xps ce of a folded and cut piece 


boxes within boxes) when 


hysical relations. 


1. Defines seventy-five wor 

2. Visualizes, imaginally, and draws appearan 
of paper. 

3. Repeats eight digits. 

4- Repeats thought of a passage heard. 

- Repeats seven digits backwards. —— " 

- Solves problems involving “ingenuity. 


i The Scoring Method. Each age level from 3 years through 10, 
ìt will be noted, has six test items (plus the alternate which may replace 


ac 
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one of the six). Each of these carries credit of two months, so that the 
tests in each of the age levels provide a year's increment in mental age. 

There are no tests at the 11-year level, the reason being that the authors 
of the scale were, apparently, unable to devise tests that would indicate 
a one-year difference at this stage of mental development. This gap in 
the scale, it is believed, is due to the slowing down of mental develop- 
ment, thus decreasing the annual increments and making it more difficult 
to measure those increments by means of the then-available test items. 
Since the eight test items at age 12 cover a span of two years, each one 


carries a credit of three months in order to yield an average mental age . 


ing the standardization of the 1916 revision, in explaining the limit for 
average adult, Terman states that his data on mental ages of 62 adults, 
including 30 businessmen and 32 high school pupils, who were over 
16 years of age, show “. . . that the middle section of the graph [of the 
distribution] represents the ‘mental ages’ falling between 15 and 17. This 
ated as the ‘average adult’ level" (5) P 55): 
à ental ages above 17 were designated as “su- 
perior adults," the possible maximum mental age on this scale being 
19-5. (Six tests, six months’ credit each, added to the maximum of 16.5 
attainable at the average adult level.) 


re the subject passes all 
The examiner then proceeds 
level is reached where the subject fails all 
“terminal year.” As already stated, each 


items. This is called the “basal year." 


two are passed at the 8-year 
; all are failed at the g-year 
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nature of which have already been explained. Terman not only found 
the IQ for each individual examined, but he analyzed the distribution of 
intelligence quotients obtained by the persons on whom the scale was 
standardized. 

Taking those subjects between the ages of 5 and 14 years, the distribu- 
tion was found to be as shown in Table 9.3. The SD of this distribution 


TABLE 9.3 


DISTRIBUTION OF IQs or 905 UNSELECTED CHILDREN, 
Acrs 5-14 YEARS 
ee WC Aue cM — 


Percent 
IQ of total 
pe I ee Seu o s 
56- 65 0.83 
66- 75 2.8 
76- 85 8.6 
86- 95 20.1 
96-105 33.9 
106-115 23.1 
116-125 9.0 
126-135 2.3 
136-145 0.55 


a SE O INE uiu Kam LLL 


Source: L. M. Terman et al. (6, p. 133). By 
permission. 


is about 12 points. (Compare this with the SD of 16 points of the 1937 
revision.) 

This distribution, being a fairly symmetrical one, showed that the 
scale did differentiate among the several levels of mental capacity of the 
persons examined; at least, so far as concerns the mental processes being 
tested. It therefore strengthened the belief of Terman and many others 
that the 1916 Stanford-Binet had considerable validity. : 

Another method used to represent the frequency with which different 
degrees of intelligence occur was to indicate the percentage of subjects at 
and above, or at and below, a given IQ as in Table 9.4. Although the 
percentages above or below certain IQ levels are not fixed or identical 
for all tests (for example, the distribution for the 1937 Stanford-Binet is 
hot identical with this one), a table such as this is significant in that it 
Provides one means of determining an individual's relative status in 
respect to the psychological processes being measured; for, as already 
€xplained, the IQ is an index having educational and clinical connota- 
tions. 
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TABLE 9.4 


PERCENTAGE DISTRIBUTION Or IOs: 
SrANFORD-BINET Scare, 1916 


The lowest 1% go to 70 or below 


j 5r 206 = " ag « n 
S " 3965 " " 476 " 
= j 595 ** 9 »g a i 
di j 10%: ** * gp «€ u 
A » 15% " " gg « ^ 
ri < 20% " “ gp « M 
s E 25% " e gg « « 
* "o 339, " " gg « a 
The highest 1% reach 130 or above 
2 200 “ 128 " u 
E F 8% " J“ u 
3 5 5% " J2“ u 
^n E 10% “ 116 « 
y S 15% “ 113 « a 
s E 20705 " o“ « 
i E 25% " log“ « 
i “ $8595 « i09 =“ at 
Source: L. M, Terma 


n (5, p. 78). B is- 
sion. up. om Y permis 


tools and conveniences. 

| y eir names, a classification is useful 

nt device for purposes of research and analysis of 
abel and Pigeonhole an individual 
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who has been examined and for whom an intelligence quotient has been 
obtained. The trend is away from stating test results in terms of MA and 
IO alone; the trend is toward the evaluation of individual performances 

Adult Mental Age and Adult Intelligence Quotient. Since the 
1916 Stanford-Binet includes tests at the levels of average adult and 
Superior adult, it was necessary to make provisions for the calculation 
of adult mental ages and intelligence quotients, These, however, present 
Special problems. 


TABLE 9.5 


SUGGESTED CLASSIFICATION OF IQs: 
STANFORD-BINET SCALE, 1916 


I EE ———— uon L9 


IQ Classification 
Above 140 “Near” genius or genius 
120-140 Very superior intelligence 
110-119 Superior intelligence 
90-109 Normal, or average, intelligence 
80-89 Dullness 
70-79 Borderline deficiency 
Below 70 Definite feeblemindedness 


Source: L. M. Terman (5, p. 79). By permission. 


We have already quoted Terman’s reason for locating average adult 
Performance in the mental age range of 15 to 17, with the assumed mid- 
Point at 16 years. If this is correct, it means that the test performance of 
the average adult is equal to that of the average 16-year-old individual. 
Otherwise stated, it means that in the case of an average adult, his maxi- 
mum level of measured intelligence is reached at the age of 16 and that 
there are no increments thereafter. Terman states that ". . . in so far as 
it can be measured by tests now available, [intelligence] appears to im- 
Prove but little after the age of 15 or 16. . . . Although this point [at 
Which intelligence attains its final development] is not exactly known, it 
Will be sufficiently accurate for our purposes to assume its location at 
16 years” (5, p. 140). Thus, until the process of decline sets in, the average 
adult continues to have a mental age of 16, according to the 1916 Stan- 
ford-Binet, 

On the basis of this assumption, then, in the calculation of an IQ for 
Person who is 16 years of age, or older, the denominator in the formula 
(IQ = MA/CA) is always 16. Otherwise, if his actual CA were used, he 
Would appear to be getting less and less intelligent with the succeeding 
Years, For example, an average individual at the age of 16 will have an 
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IQ of 100 (16/16). At the age of 18, he should still have an IQ of 100 
even though, according to the tests being used, there has been no further 
measurable development of mental capacity; for the formula will still 
be 16/16. Now if, in the case of this same individual, his actual CA was 
still used as the denominator, his IQ at age 18 would be shown as about 
89 (16/18); at age 20, it would be shown as 80 (16/20), and so on, while 
as a matter of fact there would ordinarily have been no such decline. 
Thus, by using the denominator of 16 in the IQ formula for all persons 
above age 16, if a person of 20 years and one of 60 years have the mental 
age of 16 years each, then each will be given an IQ of 100. 

The reader has also noted that it is possible to attain mental ages 
above 16 on the 1916 revision, the maximum being 19.5 years at the level 
of superior adult. If the definition of mental age is borne in mind, it will 
be apparent at once that a mental-age rating which is higher than that 
of the average adult has been given a new and specialized meaning. It 
cannot have the same meaning as the term "mental age" does ordinarily. 
A mental age is defined as the level of mental development of the average 
or typical group of persons at that same chronological age. Thus, an MA 
of 10 represents the test performance and mental level of a group of 
average children of chronological age 10. Hence, if we assume that 
average or typical adults reach a mental age of 16, then to speak of a 
"mental age" above 16 is to introduce a new concept; for these latter 
"mental ages" are not derived from the performance or norms of average 
or typical persons. They are theoretical and hypothetical indexes devised 
to enable us to indicate higher than average mental levels and higher than 
average intelligence quotients.5 Thus, when a higher-than-average adult 
"mental age" is used, it is essential that the user be aware of the fact that 
a new and different concept is being employed. 

The fact that the highest possible "mental age" that can be attained 
on this test is 19.5 years means that the highest IQ an adult can attain is 
about 122 (19.5/16). This maximum reveals a serious inadequacy of the 
1916 revision at the higher levels. What, for example, happens to the IQ 
of the 10-year-old child who has a mental age of 15, and an IQ of 150 
(15/10)? Obviously, to maintain that IQ of 150 at age 16 or older, he 
must be able to attain a mental age of 24 (24/16); yet the scale permits a 


maximum MA of only 19-5, with an IQ of 122. The same would be true 
for this subject after age 16. 


Criticisms of the 1916 Stanford-Binet. 


Experience with the 
1916 Stanford-Binet demonstrated that it was inade 


quate as a measure of 
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adult mental capacity. In fact, experience showed that its usefulness was 
restricted to ages between 5 and 14 years, the range between 5 and 10 
years being the most satisfactory. 

This revision was also criticized on several other grounds. First, since 
the scale was finally standardized on the basis of results obtained with 
approximately 1000 native-born white children in California, its use 
with all groups of children in all parts of the United States seemed to 
Many educators and psychologists to be a practice of doubtful validity; 
for it was held that the 1000 California subjects were not necessarily 
representative of the child population of this country. There is merit in 
the criticism, yet it must be recognized that this scale proved to be very 
useful in many parts of the United States, when employed and inter- 
preted by examiners who were familiar with its assumptions and con- 
Struction, and who, at the same time, were familiar with the backgrounds 
of the subjects they were examining. 

Second, the scale was criticized as being much too heavily weighted 
With verbal and abstract materials, thus penalizing the individual who, 
for whatever reason, had been handicapped in developing his “verbal 
intelligence” through the medium of the English language. Terman's 
reply to this criticism was that intelligence at the verbal and abstract 
levels is the highest form, the sine qua non, of mental ability. Indeed, 
he defined intelligence as the ability to deal with abstract terms and to do 


c ait 
9nceptual thinking. ildr 
his criticism of the scale was warranted, nevertheless; for children 


who are handicapped by lack of opportunity to acquire and develop the 
Use of English are at a serious disadvantage and get spuriously low ratings 
2n PSychological tests that emphasize verbal intelligence. Such children 
Would include (1) those who have developed in homes where only a 
foreign language is spoken; (2) those who are handicapped by serious 
Visual or auditory defects; (3) those handicapped by sensory anomalies 
reversals, inversions, mirror-writing, poor sound discrimination) that 
Serious]y interfere with their learning to read; (4) those who are too 
Pe UDg—that is, below age 4 or 5—to be tested adequately by means 
Verbal materi xclusively. i ; , 

hird, Ms berti found to be defective at some points with 
Tespect to procedures in administering and scoring, thus detracting from 
Nts obj bility of results obtained by different 
it was to be expectea that other scales 
f the performance and nonverbal 
econd criticism. These are 


, »- View of these criticisms, 
uld be developed, particularly those o 


t zm 

pem Which would obviate or minimize the 5 
esent, i : hapter. 3 

ed and discussed in a later chap eas Stanford-Binet itself should 


t Was to be expected, also, that mea 
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undergo revision in the light of experience, criticism, and accumulated 
data. Such a revision, begun about ten years after the original Stanford- 
Binet appeared, was published in 1937- 


- 


- 
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THE STANFORD-BINET SCALES: 
1937 AND 1960 REVISIONS 


The 1960 revision of the Stanford-Binet Intelligence Scale is 
based upon the materials and standardization of the 1937 revision. In the 
former, the more satisfactory items of Forms L and M of the 1937 version 
Nave been retained and combined into a single scale; some individual 
items have been more satisfactorily located as to age levels; the deviation 

Q has been introduced; and several additional improvements have been 
Made, To understand and adequately evaluate the 1960 revision, it is 
Necessary, therefore, to be well-informed on the 1937 scale. Furthermore, 
Since there is a wealth of experimental and empirical information on the 
1937 edition which is significant in the interpretation of scores and Te- 
SPonses to both revisions, and since many school and clinical psychologists 

ave developed a high degree of sensitivity in its use, it is probable that 
€ 1937 version will continue in use, for a time, In some quarters. 


The 1937 Scale 


Ae Description. This scale differs from that of 1916 in many de- 
ails, but it does not differ in its essential and basic conceptions. As the 
authors themselves state, “The revision utilizes the assumptions, methods, 
and Principles of the age scale as conceived by Binet." They do, however, 
regard it as a better standardized and more useful scale than its prede- 
Céssors, The principal differences and modifications follow. 

The 1937 scale has two equivalent forms (L and M), each of which 
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contains 129 test items, as compared with the go items in the first Stan- 
ford-Binet. Items that proved unsatisfactory in the original were elimi- 
nated, and new ones were added. 

The 1937 scale extends downward to the level of age 2, and upward 
through three levels of “superior adult” (known as Superior Adult I, 
II, and III), thus increasing its usefulness. 

The levels below age 5 and those above age 14 have been more care- 
fully and validly standardized. 


Scoring standards and instructions for administering the tests are im- 
proved. . 

From the age of 2 to age 5, this scale provides groups of test items at 
half-year intervals. Thus, more accurate and more highly differentiating 
test results are obtainable. The half-yearly intervals are possible because 
the rate of mental growth is most rapid in the earlier years and, there- 
fore, the more rapid periodic increments are susceptible to testing. 

Groups of tests are provided at ages 11 and 13; there were none at these 
levels in the 1916 scale for reasons already stated in Chapter 9. 

Although the 1937 scale is predominantly verbal in character, it does 
provide more performance and other nonverbal materials at the earlier 
age levels, especially through age 4. The performance materials are those 
with which the subject has to do something; for example, build a pattern 
or make a design with blocks, or fill in a formboard with the variously 
shaped blocks. Other nonverbal materials include such activities as copy- 


ing a geometric figure, completing the picture of a man, and discrimi- 
nating between forms. In all of these, verbal ability is a factor to the 
extent that verbal directions must be u 


nderstood. In these tests, verbal 

ability can also operate if the subject is familiar with the names of the 

objects or geometric figures and is thus facilitated in his manipulation or 
classification of them. 

The 1937 scale was standardized on a carefully chosen and extensive 

group of subjects. The base of the standardization population was broad- 


€ country, and an effort was mad 
à € to have them from 
homes which, occupationally 
population at large. 
Validation. 'Th i i 
"» " € test items were cho. i their 
Validity, ease and. ve chosen on the basis of 


ctivity of scoring, €conomy of time in administer- 


year level; from 
100 at each year, 
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ing, interest to the subjects, and need for variation in types of materials. 

Of the foregoing, validity is of primary significance. In this revision, 
a criterion of basic importance in judging validity of test items was the 
increase in percentage of successful performance with increasing age. 
This criterion was applied in two ways: first, by requiring an appreciable 
increase in the percent passing a given item in successive ages (as in the 
1916 scale); and second, by finding “a weight based on the ratio of the 
difference to the standard error of the difference between the mean age 
(or mental age) of subjects passing the test and of subjects failing it" 
(32, p. 9). Stripped of its statistical terminology, this quotation means 
that the difference between the average age (chronological or mental) 
of subjects passing an item, on the one hand, and the average age of sub- 
jects failing that item, on the other hand, must be statistically significant. 
This is essentially an “age criterion.” In this connection see Table 10.1. 


TABLE 10.1 
PERCENT Passinc TEST ITEMS LOCATED IN YEAR VI 
(Form L) 
weet ee oe 
Ages 

Item 4 1% 5 5% 6 7 8 
l ae 15 36 50 67 (0 £80 ey 
2 1 29 44 55 70 86 95 
3 11 26 46 53 69 86 96 
4 8 11 43 48 71 94 96 
5 16 29 47 51 73 94 95 
6 26 44 52 61 81 91 93 


Source: Q. McNemar (23, p. 92). By permission. 


A second criterion of major importance in the retention of ari item 
Was its correlation with the total scores of the individuals of the age 
level at which the test item is located. Table 10.2 presents the distribution 
of correlation coefficients (biserial) for both Forms L and M. 

The calculated median of the coefficients for Form L is approximately 
69; the middle 50 percent of the coefficients fall between approximately 
:51 and .73. The range for the whole set of coefficients is from .28 (memory 
for designs, year 11), to .89 (abstract words, year 11; and vocabulary, 
year 14). 

On Form M, the calculated median coefficient is approximately .64; 
the middle 5o percent of the coefficients fall between approximately .51 
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TABLE 10.2 


DisrRIBUTION OF CORRELATION COEFFICIENTS (BISERIAL) 
FOR EACH IrEM WITH TOTAL Scores 
c———————À————— 


Frequency, Frequency, 
T Form L Form M 

20— 1 2 
30— 5 7 
40— 21 21 
50— 28 23 
60— 32 42 
70— 34 23 
.80— 8 9 
.90— 2 

N 129 129 


= SE ee 


Based upon data in McNemar (23, Tables 53 and 54) 


and .71. The range for this whol 
for stories, year 13) to 91 

Of the 258 coefficients, 
This fact and the data 
(Analysis of Functions Te: 
Binet scale measures “ge 
psychological processes i 

After selection of th 


€ set of coefficients is from .27 (memory 

(abstract words, year 19). 

201, or very nearly 78 percent, are .50 or higher. 
Presented in a later section in this chapter 

sted) provide Strong evidence that the Stanford- 

neral ability" by means of test items that have 

n common to a high degree. 

€ tests that were to 
in locating each t 

earranged until i 


be used, one other empirical 
est item at an appropriate age 
t was found that they would 
mental age that was identical 


$ tests So as to match those of Form L at 
each age level with respect to 
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cially within the first several years after it had been made available. For 
many years, the Stanford-Binet has been regarded as the standard crite- 
rion, among psychological tests, for this purpose. (Also since their appear- 
ance, the Wechsler scales have been widely used for predicting educability 
at the elementary and secondary levels.) 

In the basic subjects at the elementary school level, correlations of the 
following magnitudes have been found. 


Reading: A majority are .60 or higher. 
The modal interval is .60—.69. 

Arithmetic: A majority are .50 or higher. 
The modal interval is .50-.59. 

Spelling: A majority are .45 or higher. 
The modal interval is .45-.55. 


At the high school level, the Stanford-Binet correlates well with grades 
in academic subjects, especially with those that are largely verbal. The 
approximate medians of the correlation coefficients are as follow. 


Reading comprehension -70 
Knowledge of literature 60 
English usage 60 
History 60 
Algebra 60 
Biology 55 
Geometry +50 
Spelling 45 
Reading rate 45 


Reliability. Comparing IQs obtained with Forms L and M, 
Terman and Merrill report reliability coefficients from .go to .98. The 
highest coefficients were found for IQs below 70 (r — .98), the lowest for 
IOs above 130 (r = .99), and intermediate coefficients for IQs close to 100 
(rz :92). Age levels above 6 years showed greater reliability (r = .93) than 
those below 6 years (r — .88), when coefficients were calculated for separate 
age groups. 

These coefficients are of the same general order as those found by other 
investigators in subsequent studies. The Stanford-Binet has thus proved to 
bea highly reliable test, since most of the coefficients reported for different 
age groups and different IQ levels are very close to .go. 

Many test-retest studies of Stanford-Binet reliability, after long inter- 
vals, have been made. These researches are generally in agreement that 
Correlation coefficients decrease as the interval between two testings is 
lengthened. Correlations increase, however, as the children grow older, if 
the interval between two testings is held constant; for example, one year 
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FORM L IQ 


FORM M IQ 


Fic. 10.1. Scatter chart of correlation between Form L and Form M. 


IOs at chronological a = 
à 8€ 7. 7 —.91. From L. M, T an and M. A. 
Merrill (32, P- 45). By permission. x ie di 


THE 1937 SCALE 217 


the same Broup in the first follow-up after only ten years. The correlation 
between the 1941 and the 1956 testings is 85" (6). 

Although these findings are cited under the heading of reliability, it 
is doubtful if test-retest results obtained after such long intervals actually 
are measures of an instrument's reliability, especially when one set of 
measures is obtained in early childhood (before age 5). For the findings 
are the results not only of the test's inherent reliability; they are affected 
as well by changes in the form, content, and organization of mental 
activity as age increases during the period of development; by the char- 
acteristics of individual mental growth curves; by the terminal age of 
mental development; and by motivation. (See also Chapter 18, Scales 
for Infants and Preschool Children.) "Therefore, when a test's inherent 
reliability is to be estimated, the testings should take place within a 
relatively short space of time. : 

Mental Age and Intelligence Quotient. The scoring method 
is the same as that used with the 1916 scale for determining mental age 
and intelligence quotient. There are, however, a few differences in the 
details, 

Whereas the maximum mental age attainable on the 1916 Stanford- 
Binet was 19 years and 6 months, the maximum on the 1937 revision is 
22 years and 10 months. It will be recalled that with the first Stanford- 
Binet scale, a maximum CA of 16 was used in the denominator to de- 
termine the IQ of an individual 16 years of age or older. In the 1937 
Scale, the maximum CA in the denominator is 15. Thus, the highest pos- 
sible IQ attainable by a subject who is 15 years of age or older is 152, that 
1S, (22 = 10/15) 100, ; : 

In the superior-adult levels, the test items were selected and their credits 
allotted (in terms of months) in such a manner as to make the IQ distribu- 
tion of “|, the older subjects resemble closely those of the younger, as 
Presumably should be the case on an ideal scale” (32, p. 30). In order to 
achieve this desired goal, it was necessary for the authors to make adjust- 
ments in the denominator of the 1Q formula, beginning at the age of 13 
years and 2 months. The reason given for this adjustment is that it is 
extremely difficult, perhaps impossible, to escape the effects of selection 
of subjects at the upper ages in standardizing a scale. The selection gen- 
erally is such as to include the average range, the higher mental levels, 
and the moderately retarded, but not the lowest, since less intelligent 
individuals tend to leave school earlier than others. Hence, norms of 
test performance of the older groups, it is argued, tend to be higher than 
they should be for an unselected sampling. These higher norms, in turn, 
tend to reduce the intelligence quotients of subjects in the older groups. 
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It is this effect that the authors of the scale sought to correct by means of 
adjusting the denominator. 


Although Terman and Merrill believe they minimized the effects of 
selection, these were not wholly eliminated. Therefore, after a ‘‘trial-and- 
success" process directed toward making IQ distributions of older age 
groups resemble closely those of younger groups, the procedure adopted 
was the cumulative dropping of one out of every three additional months 
of chronological age from age 13 to age 16, and all of it after 16. In sub- 
stance, this practice is equivalent to saying that average adult mental age, 
on the 1937 scale, is 15. Table 10.3 gives a few examples. 


TABLE 10.3 


Correction or CA Divisor: 1937 SrANFORD-BINET SCALE 


—— eee 


Corrected 

Actual CA CA divisor 
13-0 13-0 
13-3 13-2 
14-0 13-8 
14-6 14-0 
15-0 14-4 
15-6 14-8 
16-0 15-0 


OION 
TARA 


Sounck: Terman and Merrill (32, p. 31). 
By permission. 


When, therefore, this scale is used with subjects who are more than 13 
years old, it is necessary to refer to the full correction table provided in 
the manual. Or the examiner may use the tables of IQs provided in the 
manual, in which the necessary adjustments have already been made. 

Distribution of IQs. The mean IQs, for the subjects used in 


the standardization, are slightly above 160. But this, the authors say, !5 
owing to an "intentional adjust 


quate samplin 
justment was 


h age level from 2 to 18 years, 


as shown i ; 
mi Tab same data are represented graphically in 


1 le 10.4. Th 
Figure 10.2, * 5 
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35- 45- 55- 65- 75- 85- 95- 105- lI5- 125- I35- 145- 155- 165- 

44 54 64 74 84 94 104 ll4 124 134 144 154 164 174 
iQ 

Fic. 10.2. Distribution of composite L-M IQs of standardization group. 


From Terman and Merrill (32, p. 37). By permission. 


In the determination of the equality and comparability of IQs from age 
to age, not only must the means be very much the same (ideally, identi- 
cal), but the variations should be the same at all age levels. If the dif- 
ferences between the variations of the age groups are large, then the 
Same numerical IQ will have different significance at different chrono- 
logical ages. 

Consider the following hypothetical instance. Suppose a given test of 
mental ability yields results: 


Chronological Age Mean IQ. Standard Deviation 
10 100 14 
100 20 


11 


Accordingly, a 10-year-old child having an IQ of 86 (that is, one standard 
deviation below the mean) would have a percentile rating of approxi- 
mately 16— which, it will be recalled, means that this child surpasses 
about 16 percent of his age group. Now, according to the foregoing data, 
à child of 11 years whose IQ is 8o (likewise one standard deviation below 
the mean of his group) would also have a percentile rating of about 16, 
IM spite of the fact that his intelligence quotient is six points below that 
of the 10-year-old in question. While this difference of six points may 
make little practical difference in the clinical and educational treatment 
Of these children, it is necessary to be familiar with the implications of 


differences in variations. 
Another aspect of the problem is this: taking this hypothetical case of 
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TABLE 10.4 


IQ Mrans ApjJusrEp For 1930 Census FREQUENCIES 
OF OCCUPATIONAL GROUPINGS 
Composite or Forms L AND M, SrANrFORD-BINET 


Age N Raw Smoothed 
2 76 102.1 
25 74 104.7 103.3 
3 81 103.2 104.1 
ay, 77 104.3 102.2 
4 83 99.2 101.6 
41, 79 101.2 100.8 
5 90 101.9 100.4 
5y, 110 98.2 100.0 
6 203 100.0 99.8 
7 202 101.2 100.8 
8 203 101.1 102.0 
9 204 103.6 102.7 
10 201 103.5 103.0 
11 204 101.9 102.2 
12 202 101.2 101.6 
13 204 101.8 101.0 
14 202 100.0 101.3 
15 107 102.0 101.3 
16 102 101.8 103.3 
17 109 103.2 103.8 
18 101 106.3 


Source: Terman and Merrill (32, P- 36). By permission. 


the 10-year-old boy, if he is to have at a 
held at age 10, his 1Q will h 


!! 
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TABLE 10.5 


IQ Vartapiuity 1N RELATION TO AcE 
(STANFoRD-BINET) 


SSS Eee 
SD 


CA N Form L Form M 
2 102 16.7 15.5 
215 102 20.6 20.7 
3 99 19.0 18.7 
31 h 103 17.3 16.3 
4 105 16.9 15.6 
4 pA 101 16.2 15.3 
5 109 14.2 14.1 
5 1 110 14.8 14.0 
6 203 12.5 13.2 
7 202 16.2 15.6 
8 208 15.8 15:5 
9 204 16.4 16.7 
10 201 16.5 15.9 
11 204 18.0 17.3 
12 202 20.0 19.5 
13 204 17.9 17.8 
14 202 16.1 16.7 
15 107 19.0 19.3 
16 102 16.5 17.4 
17 109 14.5 14.8 
17.2 16.6 


18 101 
Mus x o e M C NU UM eng 


Source: Terman and Merrill (32, p. 40). By permission. 


shows Significant fluctuation in standard deviations of the several age 
8toups, especially between the extremes: 12.5 (SD) at age 6; 20.6 (SD) at 
age 214; and 20.0 (SD) at age 12. It will be noted, too, that the standard 
deviations fluctuate around 16 and 17 as a median value. The standard 
deviation of the composite IQs (Forms L and M) is 16.4 for the entire 
Standardization group of subjects. 

In respect to the fluctuations in IQ variability, the authors state: "Not- 
Withstanding Our strenuous efforts to correct for . . . errors of sampling, 
-omplete success is hardly to be expected, and a considerable degree or 
Irregular fluctuation in the found magnitudes of IQ variability from age 
to age could reasonably be attributed to these sources of error, . . . Since 
Inspection of the values reveals no marked relationship between IQ vari- 
ability and CA over the age range as a whole, we may accept 16 points 
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as approximately the representative value of the Wc pur emma ea 
for an unselected population" (32, pp. 89-40). As evi e " ra E 
this position, the authors of the scale present the graph end ere oS 
10.3. These distribution curves of composite (L and M) intellig a 
tients indicate that their variability is approximately the same 

three age-level groupings.” 


26 


I 
Ages 2-5V; 
Ages 6-12 -— 


N= 662 
-7-N*1419 


PERCENT 


m 
A 


- 
BErsEEE 


5s was - 165-7 

35- 45- 55- 65- 75- 85- 95- 105- 115- 125- 135- 145- 155 

a3 54 64 74 84 94 104 114 124 134 144 154 164 174 
1Q 


Fic. 10.3. Distribution of composite L-M IQs at three age levels. From 
Terman and Merrill (32, p. 41). By permission, 


Proceeding on the basis of the foregoing reasoning, that the standard 
deviation of IQs is 16 points and that 1Q values are comparable at all age 
levels, Terman and Merrill provide a table of intelligence-quotient equiv 
alents in terms of standard deviations, using 16 IQ points as equal to a 
SD. Thus, an individual having an IQ of 116 is given a standard score o 
+1.00; an IQ of 84 is given a standard score of —1.00, etc.3 

The difficulty presented by unequal sigmas at the different age levels 
has been overcome in a more satisfactory manner in the 1960 revision bY 
using the deviation intelligence quotient. 

Suggested Classification of Revised Stanford-Binet IOs. The 


ion in Table 10.6 has been provided by one of the authors of 

"Some critics have not accepted the argument and conclusions of Terman and Merrill. 
It 45 our purpose, at this stage, only to present the scale and its rationale. 

This practice, apparently viewed with favor, since the table of equivalents was pre 
sented, is 9f questionable validity. It disregards the possibility that individual ranks 
ee be different if standardization of test items were more satisfactory at those age 
evels phere the given SDs were appreciably above or below 16. 


classificat: 


&^ 
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the 1937 revision (33, p. 18). It will be noted that the nomenclature and 
the percents in each of the several categories differ some from those of the 
1916 instrument. 

Like all such tables, its purpose is primarily descriptive and, also, to 
serve as an aid in the ordering and analysis of testing results. The table is 
valuable, as well, in showing an approximate distribution of intelligence 
quotients throughout most of the range of mental ability. 


TABLE 10.6 


DISTRIBUTION AND CLASSIFICATION OF Composite L-M IQs 
OF THE STANDARDIZATION GROUP 


IQ N Percent Classification 

160-169 1 0.03 | 

150-159 6 0.2 Very superior 
140-149 32 1.1 
130-139 89 3.1 j 

120-129 239 BB, p Superior 
110-119 524 18.1 High average 
100-109 685 23.5 

90-99 667 93.9 | Normal or average 
80-89 422 14.5 Low average 
70-79 164 5.6 Borderline defective 
60-69 57 2.0 

50-59 12 ue Mentally defective 
40-49 6 0.2 

30-39 1 0.03 


LL SE OST ŘĖŐ——_—_———_—__ 


Source: Terman and Merrill (32, p. 42). By permission. 


Analysis of Functions Tested. The items of both Forms L and 
M have been analyzed by factorial methods. McNemar, who made the 
first and major analysis, concluded that at each of the several age levels the 
items test ['are saturated with"] a common factor (g), and that this com- 
mon factor is the same one at all age levels (hence it may be called g), 
The weight of the common factor differs somewhat among the various age 
levels; but the common factor accounts, on the average, for about forty 
Percent of the differences (variance) in scores—hence for about forty per- 
cent of the differences in performance among a group of testees. 
The statistical results also suggest the presence of group factors at the 
following ages: 2, 215, 6, 18, and possibly 7 and 11. These are second fac- 
tors (group factors) that account for from five to eleven percent of the dif- 


ferences; a third factor (another group factor) contributes from four to 


3 ES! ip 1960 REVISIONS 
224 THE STANFORD-BINET SCALES: 1937 AND 19 


seven percent. The group factors do not appear to be eit at i. = 
levels, nor are they at all well defined with regard to the alana ma Sher 
processes involved in them. Tentatively, however, MoNenm: suggests he 
several of these group factors, at different levels, might be called memory 
for designs,” “motor,” “verbal.” The most definite and significant con- 
clusion, however, is that one factor, g, is sufficient to account for the 
intercorrelations of test items, with the few exceptions noted. 

In another factor analysis, the statistics indicated that after the age of 
4, the Stanford-Binet scale measures a general factor (16).^ This same al 
vestigator ascribes test performance in the first two years to "sensori-motor 
alertness.” This does not necessarily apply to the Stanford-Binet, the low- 
est level of which is year 2. Between 2 and 4 years, he finds that a second 
factor, "persistence," is operative. 

On the basis of an inspection of the test items themselves and of psy- 
chological analyses of the items, it is hardly possible to accept persistence 
as the principal factor accounting for test performance between ages 2 
and 4; nor sensori-motor alertness as a principal factor, if it persists be- 
yond the age of 2 years. . 

As a result of McNemar's factor analysis, all items “highly saturated 
with the general factor are included in the 1960 revision, although some 
with lower loadings are also included. Furthermore, biserial correlations 
of performance on individual test items with total-test score were such as 
to confirm previous findings that the 1960 scale has high general-factor 
validity.® 


When complex and comprehensive (“ 
used in a scale, 


1 68 


global”) types of test items are 
as in the Stanford-Binet, it is not surprising that attempts 
to isolate relatively simple group factors through statistical analysis yield 
results that are indefinite and at best tentative. The reason is that these 


items, being complex, involve a number of interacting psychological 
Processes, organized in varying degrees. Group and specific factors will 
ost clearly when the test items employed are fractionations 
€gments of a whole pattern of mental functionin 
on can destroy “the whole" 
mental operations with which the examining psychologist is often most 
concerned. 


fractionati 


The identification of a general factor in a revision of the Binet scale 


. ‘For an analysis that is not i 
*The 


Do [1960] measures the same intellective functions at all parts 
fs d or "s intermediate and upper age levels than for the preschool 
nges have been made and our Population samples are less good” 
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should not occasion any surprise, for it will be recalled that Binet himself 
set out to develop an instrument that should test an individual's general 
intelligence by means of sampling a variety of mental activities that are 
manifestations of such intelligence. It appears, therefore, that contem- 
porary statistical analyses, applied to the age scale, are confirming Binet's 
psychological insights. 

It was found, also—as Spearman had shown in his earlier analyses— 
that the various kinds of test items differed in the extent to which they 
tested. (are loaded with) the general factor. The following listing shows 
which items were found to have high loadings of the general factor, and 
which had low loadings (23, 33). 


AGES 2 TO 414 

Low Loadings 
Block building: tower 
Block building: bridge 
Three-hole formboard: rotated 
Motor coordination 
Copying a circle 
Drawing a cross 
Three commissions 
Stringing beads 


High Loadings 
Picture vocabulary 
Identifying objects by name 
Response to pictures 
Comparison: balls and sticks 
Comprehension 
Opposite analogies 
Pictorial identification 
Naming materials [used in making 

various objects] 
AGES 5 TO 11 

Low Loadings 
Paper folding: triangle 
Patience: fitting rectangles 
Copying a bead chain 
Reproducing a bead chain from 

memory 

Picture absurdities 
Word naming [free association] 
Word naming: animals 
Block counting 


High Loadings 
Pictorial likenesses and differences 
Similarities: two things 
Vocabulary 
Verbal absurdities 
Similarities and differences 
Naming the days of the week 
Dissected sentences 
Abstract words [definitions] 


AGES 12 TO SUPERIOR ADULT III 
Low Loadings 

Problems of fact 

Reproducing a bead chain from 
memory 

Memory for stories 

Enclosed box problem 

Papercutting [visual imagery] 

Plan of search 

Repeating digits [forward] 

Repeating digits: reversed 


High Loadings 
Vocabulary 
Verbal absurdities 
Abstract words [definitions] 
Differences between abstract words 
Arithmetical reasoning 
Proverbs 
Essential differences 
Sentence building 
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Examination of the items that have low loadings MD 
ly a very limited range of functioning and ihat he xd 
e erg i i sses: visualization (space pi 
tions, they involve only the following proce i a n 
i d spatial relationships), visual imagery, and rote m y 
R s i he low loadings, falling 
mediate recall). All of these test items among the ; P usu 
under the foregoing categories, are lacking, relatively, in um > d P 
would, therefore, not have the differentiating power of the more = er 
tasks required by the items having high loadings. The exceptions E eT 
naming (random and animals), and problems of fact. The ced nl 
erly considered to be a test of reasoning. Its low loading with 8 ed e 
fore surprising, but no explanation is apparent or available. ; a En 
word naming tests richness of free association; word naming o an = : 
tests controlled association. A possible explanation of their low load re 
might be that they are fairly routine tasks that do not require ree: 
soning (organization, analysis) demanded by the items having high loz 
ings within the same range of ages (5-11). La wer 
The foregoing listings are significant for at least two additional and 
portant reasons. First, a knowledge of test items that have high or es 
loadings of the general factor enables the examining psychologist to ene 
a more thorough analytical and meaningful evaluation of an individua E 
over-all test performance. The examiner is thus in a better position to 
evaluate the strength of the general factor in a particular testee. This ja 
particularly valuable if the psychological nature of the general factor : 
been determined. Second, inspection of the lists of items having high load - 
ing strongly indicates that the general, or common, factor is one that in- 
volves acquisition of, use of, and reasoning with symbols—namely, lan- 
guage and number—even though the testing of these begins at an elemen- 
tary level and at times utilizes nonverbal materials in presenting the 
problem (for example, pictures, sticks). The mental activities required by 
these test items have very much in common with Spearman's view that 
intelligence is essentially the ability to educe relations and correlates: _ 
Specifically, the following processes are involved in the test items having 
high loadings: acquisition and use of vocabulary; verbal analysis of E 
situation; verbal and numerical concept formation; insights into similar- 
ities and differences (also involving concept formation); analysis and sy" 


thesis of materials, both nonverbal and verbal; organization and reor- 


ganization of materials, both nonverbal and verbal. 
The list of items give 


Sis n above (based on statistical calculations) and the 
indicated Psychological functions that are involved provide an illustration 
of how statistical and psychological analyses work together. They also 
make it clear that superficial observations of differences between test items 


le that specific environmental advantages and disadvantages operate her 
‘©. 5, year XI, 1960 revision. 


"It is probabi 
See test item n 
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can be misleading as to their essential psychological processes. For ex- 
ample, the test items at early age levels requiring identification of objects 
by name or use might be regarded simply as tests of information or of 
specific rote learning, whereas they actually have much in common with 
items that test "comprehension," and which are more obviously tests of 
reasoning. Or, at a somewhat later age level, ability to define certain 
words (vocabulary test) might be regarded simply as the result of specific 
learning and verbal facility, whereas actually it has much in common 
with perception of pictorial (as well as verbal) similarities and differences. 
It is useful to classify test items as "information," "word knowledge," 
“perception of forms," “reasoning,” etc.; but the point is that such classifi- 
cation does not necessarily signify that each of the subtest classifications 
measures a distinct group factor or a special factor.* 
Types of Items. Bearing this important distinction in mind, 
then, we may indicate the types of items included in the Stanford-Binet 
Scale.9 


Test Items Functions Involved 


Years 2-5 

Form perception and manipulation 

(blocks, formboards, stringing Visual perception and anal- 

wooden beads) .— es V pores yis 
Perception of differences in size 

and form 

Visual analysis plus motor 

Visualmotor operations st" development 


. Visual perception plus be- 
Perception of relationships (in pic- — ginnings of concept forma- 
tures) tion 


Rote memory (using digits and Immediate recall 


sentences) 
Use of words in combination Language development and 
Identifying objects by name or use > .......- comprehension 


Following directions 


n close agreement with those of an independent study 
urt reports that a common factor accounts for 42 percent 
and that two subsidiary factors account, respectively, for 
-score differences. The close correspondence obtained 
d gives additional weight to and confidence in the 
intelligence, particularly of chil- 


*McNemar's findings are i 
made in Great Britain. Cyril B 
of Stanford-Binet test variance, 
12 percent and 16 percent of test 
in the United States and in Englan ) 
Stanford-Binet scale as an instrument for measuring 


dren and adolescents. See (10). i i 
* Since the Wechsler kir utilizes many of the same types of items as the Stanford-Binet, 


additional factors affecting test performance on these items will be presented in con- 
nection with the Wechsler scale in the next chapter. 
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Test Items 
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Years 2-5 


Verbal comprehension and word 


knowledge 
Understanding of “opposites” 


Form perception 


Visual-motor operations 


Years 6-12 


Rote memory (using digits and sen- 


tences) 


Word knowledge (concrete and ab- 


stract) 


Verbal comprehension 


Number concepts 
Arithmetical reasoning 


1937 AND 1960 REVISIONS 


Functions Involved 


Reasoning with abstractions 
and concept formation 


. Visual analysis 


Visual analysis plus motor 
development 


. Immediate recall 


Language development and 


concept formation 


Reasoning with abstractions 
and concept formation 


Number concept formation 
and reasoning with abstrac- 
tions 


Year 13—Superior Adult III 


Visual analysis and imagery 


Perception of visual relationships 


Visual-motor operations 


Rote memory (using digits, words, 


and sentences) 
Word knowledge 


Synthesis of verbal materials 


Problem solving, using verbal ma- 


terials 


Verbal analysis 
Arithmetical problems 


Analysis and comprehension of 
symbols 


Visual perception and analy- 
sis plus reasoning with non- 
verbal materials 


Immediate recall 


Language development and 
concept formation 


Reasoning with abstractions 


Concept formation plus rea- 
soning with abstractions 


THE 1960 REVISION 229 


The Short Scale. It is possible to administer an abbreviated 
form of the scale, the constituent test items having been specified by the 
authors. A short scale, presumably, is used when the examiner does not 
need as accurate an index of measurement as it is possible to obtain and 
when the necessary time is not available for a full-length examination. 
The use of an abbreviated form, however, should be discouraged, for 
when it is at all desirable to administer a test of mental ability, it would 
be unwise not to require the greatest possible accuracy. 


The 1960 Revision 


Merrill summarizes the changes made and the differences be- 
tween the 1937 revision and that of 1960 as follows (33. pp- 39749): 


The Stanford Revision in 1960 retains the main characteristics of scales 
of the Binet type. It is an age scale making use of age standards of perform- 
ance. It undertakes to measure intelligence regarded as general mental 
adaptability. The 1960 scale incorporates in a single form, designated as 
the L-M Form, the best subtests from the 1937 scales. The selection of sub- 
tests to be included in the 1960 scale was based on records of tests admin- 
istered during the five-year period from 1950 to 1954- The main assessment 
group for evaluating the subtests consisted of 4498 subjects aged 2% to 18 
years. Changes in difficulty of subtests were determined by comparing the 
percents passing the individual tests in the 1950'S with the percents passing 
in the 1930's constituting the original standardization group. Criteria for 
selection of test items were: (1) increase in percent passing with age (or 
mental age); and (2) validity determined by biserial correlation of item with 
total score. Changes consisted in the elimination or relocation of tests which 
have been found to have changed significantly in difficulty since the original 
standardization; the elimination or substitution of tests which are no longer 
suitable by reason of cultural changes; further clarification of n ie 
scoring principles and test administration; and the correction o pose 
inadequacies of the 1937 scale, first by introducing S en bop e the 
average mental age that the scale gives more nearly ipe „to the rare 
chronological age at each age level and second, by ahi iue and 
extended IQ tables that incorporate built-in adjustments for oe vari- 
ability of IQs at certain age levels so that the standard score IQs provided 


are comparable at all age levels. 

From the foregoing quotation it is clear that although the 1960 re- 
vision is an improved version of the 1937 scales (L and M), one important 
innovation has been introduced. This is the deviation intelligence quo- 
tient, the necessity and meaning of which have already been explained. A 
second innovation, not mentioned in the quotation, is this: IQ tables have 
been extended to include chronological ages 17 and 18 because retest find- 
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TABLE 10.7 


THE AssESSMENT GROUP TABULATED BY AREAS 


——+ c l eee NN 


Pretestin Total 

Form L FomM LM eS x. 

; à : modified or number 
item item stratified 


: a substitute of 
analysis analysis samples PSI subjects 

New Jersey 892 892 
Minnesota 850 208 1058 
Iowa e 102 (636) 
New York and 

California 96 + 588 684 
Massachusetts 91 91 
California 1258 897 200 2355 
Totals 3193 1105 200 684 5716 


Main sample 


Parable populations, but the 
to make comparisons mean- 


small numbers at the higher MA categories ERSUK BUSINS gend t 
Source; Terman and Merrill ( 


ings indicated that mental develo 
! t pment, as measured p Stanford- 
Binet, continues at least that lon ha 


Basically, as Terman and Merril i 
1960 scale is 
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those smoothed for the 1937 scale. It is clear that where, at a given age 
level, the 1937 SD is greater than 16, K will be less than one, and the 
variation of IQ, from IQ,, will be decreased when multiplied by K. Con- 
versely, when the 1937 SD is less than 16, the variation of IQ. from IQ, 
will be increased. Thus, in terms of deviations from their respective 
means, the DIQs will be comparable at all age levels. 


X 


D 
` A 
Y SD's for Stanford-Binet: AX 
Form L & M Combined £ 
d^ Smoothed r 
o J 
fa 
z 
o o 
= ;Y i 
g SD's for Deviotion IQ LH 
Y 
le 2 4 6 8 10 12 14 le 18 
AGE IN YEARS 


Fic. 10.4. From Testing Today, No. 2. Boston, Mass.: 
Houghton Mifflin Co. By permission. 


Consider, for example, the two following instances, using conventional 


IQs. 


CA: 312 CA: 6 
MA: 15 MA: 7years, 6 months 
IQ: 125 IQ: 125 


The 1937 mean IQ (L and M averaged and rounded) is 102 for age 12 and 
100 for age 6. The correction (K) for age 12 is .88 (since ais smoothed 1937 
SD is 18.2), while the correction for age 6 is 1.11 (since its smoothed 1937 
SD is 14.4).11 When these values are substituted in the formula, we get a 
DIQ of 120 (rounded) for the 12-year-old and 128 (rounded) for the 6-year- 
old. In each instance the derived DIQ has the same relation to the SD of 
16 as its corresponding conventional IQ had to its original SD (18.9 or 


€ 
smoothed values was superimposed upon the graph of the raw standard deviations 
[Forms L and M combined]. . ." (see Fig. 10.4). Testing Today, December 1959, no. 2, 
P- 5. Boston: Houghton Mifflin Co. In a personal communication, Pinneau states: “The 
standard deviations of Forms L and M for an age level were squared, summed, and 
divided by two and the square root taken to give the intermediate values which were 
desired, so they would be equally applicable to Forms L and M. li 

? For the table of the mean conventional IQs and "constants" (K), see Terman and 
Merrill (33, Appendix A). 
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14.4); but now both deviation 1Os are based on a common standard score 
and may be compared directly. By contrast, although they had the same 
numerical values, the original and conventional IQ of 125 for the Dear. 
old was higher in the distribution of his age group (since the smoothed pd 
equals 14.4) than was the 125 in the distribution of 12-year-olds (since its 
smoothed SD equals 18.2). . . 
Essentially, the 1960 scale is the same as that of 1937; but it is an im- 
proved version because of relocation of some items, selection of the most 
discriminative items from Forms L and M, and the substitution of the 
deviation IQ for the conventional onc. It is to be noted, however, that in 
no case will the former differ so markedly from the latter as to move an in- 
dividual's revised rating into a significantly different level and category, 
either upward or downward. The reason for this is that the correction 
factor (K), for changing conventional IQs into deviation IQs, is no less 


and no greater than 1.12 at any age level, while almost exactly 
fifty percent of the correction factors are between -94 and 1.06. 


Evaluations and Criticisms 


The Stanford-Binet scales have been subjected to criticism and 


evaluation on theoretical, experimental, and practical grounds. On the 
whole, psychologists and educators who have had experience with these 
instruments are in essential agreement that they have proved to be highly 
ified persons. There are, however, a number of 


versions (1916 to 1937) and twenty- 
the third (1957- 


tain responses became a 


lete within relatively few Years, and the age-level place- 
21). Terman and Merrill report 
nt in these respects. One realizes 
ed improvements and corrections 
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In the case of a point scale, on the other hand, it is a much simpler 
process to revise age norms. All other things being equal, of course, the 
simpler and easier methods should be employed to achieve a desired goal 
in psychological testing. But simplicity and ease alone should not be the 
decisive considerations. The crucial question is whether the age scale or 
the point scale provides a superior means of obtaining a measure of an 
individual's mental ability. Thus far, although the views of competent 
professional persons are not unanimous, it appears that the age scale is 
preferable for use with children and young adolescents. But when older 
adolescents and adults are to be examined, it has often been found that 
point scales (for example, the Wechsler) are preferable (15). Also, the 
use of point scales having well-defined subtests has been increasing be- 
cause, in clinical cases, they facilitate "scatter analysis" of scores on the 
subtests (see Chapter 14). 

Does the Stanford-Binet consist of a variety of disconnected tests? "This, 
scale has been criticized at times as being only that. This criticism, how- 
ever, is based upon a failure to take into account the theory of in- 
telligence, the method of measurement, and the basis upon which test 
items were selected; namely, the sampling of general ability by means of 
à representative variety of types of items, arranged by age level, in order 
to obtain an adequate estimate of the mental processes involved. The 
factorial analyses already discussed show that the test items measure 
primarily a general factor common to all age levels of the scale. Fur- 
thermore, the high biserial correlations found between individual items 
and total scores demonstrate the presence of a common factor. On cursory 
inspection, the items may seem to be unassociated; but psychological and 
statistical analyses demonstrate that in actuality such is not the case.1? 

Is the composite, or “global,” type of scale (such as the Stanford-Binet) 
preferable to the factorial analyzed type, in which “factors” are separately 
tested and scored? The final answer to this question will depend upon 
whether the factorial type of scale proves to be more valid, more ac- 
curate, and more useful than the composite type has been in clinical work 
or in educational and vocational guidance. At present two scales of the 
global type (Stanford-Binet and Wechsler scales) continue as the most 
widely used instruments for the measurement of individual mental 
ability (as contrasted with group testing), especially in school and clinical 
Studies of individual "problem cases." At present, also, some “multi- 
factor" group tests are being rather widely employed for purposes of edu- 
cational and vocational guidance. This type of device is discussed in 


Chapter 17. 


? Even those statistical analyses that emphasize group factors, rather than a general 
factor, refute this criticism. 
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Is the Stanford-Binet scale too heavily weighted with verbal a 
A criticism heard with some frequency against both the oia and a 
new revisions is that these scales place a premium upon veiba] en 
ligence" and that subjects having language handicaps are mures E 
incorrectly rated. In reply to this criticism, Terman and others hold t i 
the most essential and most significant aspect of higher thought processes 
is the ability to do conceptual and abstract thinking; that is, to — 
with language, number, and other symbols. It is maintained, also, that the 
vocabulary test, when used with children from homes where English is the 
primary language, has higher value than any other part of the scale. m 

It must be emphasized, however, that this is not so in the case of a child 
who, even though he comes from such a home, has reading or language 
difficulties due not to lack of capacity but to visual or auditory defects or 
anomalies. 

In actual clinical practice, the examiner should always supplement 
an essentially verbal test of intelligence with one of the nonverbal types 
if he has any reason to suspect that the former penalizes the subject. 
There will be occasions also when it will be desirable to obtain a rating 
for an individual on both types, even where no language handicap is x 
dicated, for the purpose of comparing two or more aspects of a subject's 
ability. While the correlation between performance on verbal and on 
some nonverbal tests of mental ability is high, coefficients of correlation 
reflect group trends and relationships; and unless the correlation co- 
efficient is perfect (plus or minus 1.00), there are always individual ex- 
ceptions from the generalization that can be made on the basis of the 


coefficient; hence the need, at times, for the study of the several aspects of 
ability in an individual case. 
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other scales provide ratings of intelligence, within limits of error, under 
existing school conditions, general environmental conditions, and clinical 
conditions. Obviously, therefore, the examiner must know the general 
developmental background of the individual he is testing, if his inter- 
pretation of test results is to be sound. 

Several studies of satisfactory and unsatisfactory responses on the 
Stanford-Binet have found that bright and superior children answer 
correctly more of the "intellectual" items than do normal or dull children 
(2, 22). These items include the verbal and numerical, utilizing symbols 
and abstractions. This finding does not mean that bright and superior 
children have attained their ratings on mental tests only as a result of 
schooling. It is to be expected that individuals who are potentially above 
average in mental ability will be superior in dealing with situations and 
problems employing language and number; for the greater the capacity 
of the individual is for mental development, the greater will be his 
ability to deal with symbols and to handle situations and problems at the 
level of abstraction. 'The converse has also long been known; namely, that 
one of the principal deficiencies of the mentally retarded and mentally 
defective is their inability to deal with materials and concepts at the levels 
of abstraction. It will be recalled that one definition states that intel- 
ligence is the ability to deal with abstractions. It will be recalled, also, 
that the educing of relations and correlates extends upward to the use of 
symbols (language and number). d 
Are some of the items in the Stanford-Binet scale obsolete? Since any 
test utilizes materials from the environment in which it is to be used, and 
since environments normally undergo change, it is to be expected that 
some items in any test will in time become culturally obsolete. In the 1937 
Stanford-Binet there were a few such items. Some examples follow: 


Identifying a toy steam locomotive by name : : 
Identifying objects (pictures) by use, such as an old-fashioned kitchen stove 


Response to a picture (Messenger Boy, year 12) 


Terman and Merrill state: “In the 1930's - - - 69 percent of the three- 
year-olds of the standardization group recognized and could name 5 out 
of 6 items consisting of miniature object reproductions of shoe, watch, 
telephone, flag, jack-knife, and stove. In the 1950’s only 11 percent of 
children whose mental age on the scale was three years were able to do 
so” (33, p. 19). This change can be explained by the obsolescence of some 
of the objects. 

There are not many such items in this scale. In the case of some non- 
obsolete items, however, it becomes necessary, in time, to revise, to some 
extent, the responses that are acceptable for credit. The following item is 
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an example. "What's the thing for you to do when you are on your way 
to school and see that you are in danger of being late?" (Year 7). 

When scoring an unusual response to an item like this, the qualified ex- 
aminer is warranted in exercising his judgment as to its correctness; in 
fact, he has to do that. And his decision will be based upon his familiarity 
with the psychological processes being tested by the particular item.!? 
Does the Stanford-Binet scale test different abilities at different age 
levels? The answer to this question is to be found largely in the preceding 
discussion of “Analysis of Functions Tested” in this chapter. There it was 


levels, and in addition there appear to be group factors at 2, 214, 6, 18, 
It was pointed out that the items having low 
with minor exceptions, tests that 
visual imagery, and rote memory. 
necessity of distinguishing between 
vocabulary, arithmetical reasoning, etc.) 
and basic psychological processes involved in each. 

selection and retention of test items for 
correlation of each item with the total 
the retained items shows that there are 
; dings of the general factor. Thus, the 
€ general factor throughout the scale to a 
5 that of 1957. 

nality and creative abilities? The answer 
hese abilities, as such, to an important de- 
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cluded in the prescribed objective scoring, a qualified examiner will note 
responses indicating these traits and will include them in his interpreta- 
tion and evaluation of the examinee's performance. 

Is the Stanford-Binet scale adequate at the adult level? The stand- 
ardization group of the 1937 scale included individuals of 18 years, but it 
did not include an adult population. Therefore, the test items at the sev- 
eral adult levels rest upon theoretical considerations already mentioned, 
rather than upon actual samplings of adult performance. One result, due 
perhaps to the methods used in standardizing the scale at superior adult 
levels, has been its frequently observed inadequacy with college students 
who, as a group, would be ranked above average. The inadequacy of the 
scale is especially marked when administered to very superior students, 
for it is not difficult enough at the higher levels of adulthood. 

In the 1960 revision, the retained items are an improvement throughout 
the scale. IQ tables have been adjusted to include ages 17 and 18, be- 
cause retest findings in more recent years indicate that mental growth con- 
tinues to age 18. The ratings obtained to age 18, therefore, will be more 
valid than previously. Consequently, too, the ratings above age 18 should 
be more valid for the large majority of individuals, with the probable 
exception of those at the highest levels of mental ability. ; 

Some psychologists have questioned whether certain items and materials 
included at adult levels are of sufficient interest to persons of their age. 
The 1960 revision is an improvement in this respect, but it should be 
clear that it is not possible to provide items which will appeal to the 
Special interests of many adults. In fact, it is undesirable to give any group 
such an advantage. On the other hand, the test items should be at a 
sufficiently advanced level of maturity, such as will evoke a favorable and 
cooperative attitude on the part of the examinee. : 

There is also the question of the soundness of using the mental-age 
index for individuals who score above the mental level of the average 
adult. The reason for this question has already been presented. Many 
psychologists, therefore, prefer to discard the mental age at adult levels 
in favor of percentile ranks, decile ranks, and standard scores. 

Is the Stanford-Binet scale clinically useful? Judging by the extent to 
which it is employed, the answer must be a strong affirmative. Examiners 
find the scale useful not only for deriving a mental age and an intelligence 
quotient, but also as the “framework” within which a psychological in- 
terview may be held. If the scale is to serve this purpose, the examiner 
must have considerable clinical experience and skill to interpret and 
evaluate a subject’s responses and behavior. Nearly all psychologists agree 
that there is great value to a clinician and to a school psychologist in 
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having available objective and standardized devices IA permit 
sufficient flexibility to meet the demands of a particular cas il "mes 

As new scales are devised—built, perhaps, upon aguante i à 
and theories—they will have to be subjected to both proie sed 
vestigation and practical use before their value can be € pee 
that of the Stanford-Binet. A valid judgment cannot be — dits 
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II. 


THE WECHSLER SCALES 


The Binet scale and its several revisions are largely verbal in 
content, although some nonverbal items are included, especially at the 
early age levels, There are, however, other scales that are wholly, or in 


large part, of the performance or nonverbal type. In performance tests, 


the use of language is eliminated from test content and response, al- 


though directions are generally given orally. In a few instances the di- 
rections, too, are given without the use of language, by employing pan- 
tomime instead. 

The test materials of a nonverbal scale consist of concrete objects such 
as formboards, cubes (to be arranged in specified ways), mazes, geometric 
figures, pictures (cut up, to be correctly assembled), and others that will 


be described in later sections. The individual's responses depend upon 


Manipulations, visual perceptions, and interpretations that are implied 


by what he does rather than by anything he says. 
Performance tests were first devised as a supplement to or substitute 


for the Stanford-Binet scale in order to examine deaf, illiterate, or non- 
English speaking subjects. Since their introduction, the use of nonverbal 
tests has been extended; for they are now utilized with children who have 
9r are suspected of having reading difficulties, with those who have at- 
tended school irregularly and thus might have been handicapped in de- 
veloping verbal ability, and with persons who might have been handi- 
capped by markedly inferior environmental conditions. Nonverbal tests 
are used, also, by examiners who, for any other reason, believe that such 
241 
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a scale will yield a more complete picture of the individual whose capac- 
ities are being analyzed and evaluated. The Wechsler scales to be de- 
scribed combine verbal and nonverbal m 
ment to obtain the advanta 
both types of test items. 


aterials within a single instru- 
ges, comparisons, and contrasts provided by 


Form I (1939) 


Range. The first scale, published in 
test the intelligence of persons from the age of 10 years through 60, al- 
though norms were provided beginning at 7-5 years. This, or a similar 
beginning level, was necessary if adults and adolescents of inferior mental 
levels were to be tested by means of the scale.} 

Content. 


The 1939 version (known as the Wechsler-Bellevue 
scale), and the 1955 


: 1 revision (known as the Wechsler Adult Intelligence 
Scale), which will be described later in detail, are in part verbal and in 


1939, was intended to 
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shown here taking the block design 
del of the design to be copied is 
ore her. The examiner is timing her with the stop- 


watch in his left hand. (Acme Photo.) 


Fic. 11.1. A 12-year-old girl is 
test of the Bellevue scale. The mo 
on the card bef 


It will be recalled that Binet’s own 


Need for an Adult Scale. 
nor was the 1916 Stanford- 


scales were not suitable for use with adults; 
Binet. Although the Stanford-Binet revisions of 1937 and 1960 are better 
Standardized at the upper age levels, they are not intended primarily for 
adults, whereas the Wechsler scale of 1939 Was designed for adolescents 
and adults, and the 1955 scale is for use with persons above the agc of 17 
years, 

As explained in Chapter 10, 
average adults. The Wechsler sca 
be seen, by not using mental ages at all. ; 

Intelligence testing of adults was begun on à large scale in 1917 wi.h the 
establishment of a psychological division in the United States Army, in 
World War L At that time, the Army Alpha (verbal) and Army Beta 
(nonverbal) tests were assembled, and with them about 1,750,000 men 
were tested. This experience in large-scale testing provided the impetus 
for the development, after the war, of a number of other group tests for 
adults. But these tests did not prove to be adequate, except for several 
that were designed for use with selected and limited groups of our popu- 
lation, such as candidates for admission to colleges. (These will be pre- 


sented in a later chapter-) 


it is difficult to use mental ages with above- 
le for adults avoids this difficulty, as will 
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The Underlying Theory of Intelligence. The first step was to 
adopt a theory of the nature of intelligence to serve as a framework within 
which the test items should fit. The general-factor theory (g) was ac- 
cepted, requiring that there should be significant intercorrelations 
among the several subtests of the scale, and that the scale as a 
whole should provide a valid index of an individual's general intelli- 
gence. 

Having accepted the theory of a general factor, the author of the scale 
then had to determine which types of test materials should be included 
in order to measure that factor most effectively. On the basis of past ex- 


d nonverbal materials, rather 


preceding Psychologists, and that experie 
proved to be valuable, In some instances, specific items already in use 


her instances, it was 


necessary to create new items for each of the subtests, 


nce the scale was to provide norms below 
adult levels, a sampling of school children 16 


eria were used to establish validity 
teachers’ ratings, 


tion with the Sta 
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publications were heavily weighted on differential diagnosis and other 
clinical aspects.5 

Reliability studies were surprisingly meager in number and those that 
were made were concerned almost exclusively with abnormal groups of 
subjects. Furthermore, reliability coefficients for each of the subtests were 
not high enough to warrant making differential diagnoses, which are 
based upon the patterns (" profiles" or variations) of subtest scores. On the 
other hand, reliability coefficients for verbal IQs (six subtests), perform- 
ance IQs (five subtests), and full-scale IQs were quite satisfactory, though 
not as high as those found with the Stanford-Binet. 


The Adult Intelligence Scale (1955) 


Although it does not introduce any new principles in content, 
Construction, organization, scoring, or IQ deviation, this edition, known 
as the Wechsler Adult Intelligence Scale (WAIS), has met some of the 
adverse criticisms of the 1939 version. The principal changes are in its 
improved content, extension of the standardization population sample, 
and improved directions for administering and scoring.® 

CONTENT. Range of difficulty has been extended, chiefly downward, in 
Order to assure a score for the lower level of mentally deficient subjects. 
Upward extension in difficulty has been slight. Progression of difficulty 
from item to item has been improved. Obsolete items have been re- 
Placed. Items having poor “item validity” and those overlapping others 
in content have been replaced, as have those that were ambiguous. Illus- 
trations in the “picture completion” subtest have been more clearly 
drawn. The “vocabulary” subtest has been revised so as to produce a 
fairly normal distribution of scores for a representative sample of the 
Population, Maximum scores in verbal and nonverbal subtests and in 
the full scale are reached by the 25-29 year age group. 

PoPuLATION SAMPLE. Norms are based upon a sample of 1700 persons, 
850 of each sex, from four widely separated geographic areas. The sub- 
Jects ranged in age from 16 to 64 years. The age range was divided into 
Seven age groups, within each of which the numbers were proportioned 
according to the 1950 United States census with respect to geographic 
area, race (white and nonwhite), occupation, urban-rural, and years of 


" Differential diagnosis is defined as distinguishing between two or more conditions 
(classifications, categories) by means of the patterns o£ symptoms K(onsiüdicators) ithat 


Characterize ition 
e each condition. " : 
b soe nly to describe th 
"The f£ llowi li s of revisions are provided not only scribe the new 


Scale itself, but also to enable the student to understand more clearly the processes of 
, 
test development and improvement. 
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formal education. Supplementary data were also obtained for a sample 

of older persons (N — 852) above 65 years of age. 
The Subtests. The instrument has six subtests that constitute 

the verbal scale and five in the performance scale. 
1. Information test. This test consists of items of information covering 
a wide range. (For example, “How many weeks are there in a year?") 
The assumptions are that the questions cover a wide enough range of 
materials to provide an adequate sampling of information acquired by 
a person who has had the usual opportunities of our society; that the 
range of an individual's information is an indication of his intellectual 
Capacity; and that more intelligent Persons have broader interests, 
more curiosity, and seek more mental stimulation, This view can be valid, 
however, only if the subjects being tested have had the usual opportu- 
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4. Similarities test. This part of the scale consists of twelve sets of 
paired words; the subject is required to state in what way the words in 
each pair are similar (for example, orange-banana). The author of the 
scale regards the similarities test as one of the most satisfactory, for it 
appears to sample very well the "general factor" (Spearman's g), or what 
is commonly called general intelligence. 

5. Memory Span for digits, forward and backward. The subject is re- 
quired to repeat series of digits heard once. The series vary in length 
from three to eight (backward) and nine (forward). This is a test of 
immediate memory span. Psychological studies—both experimental and 
clinical —have consistently shown that tests of immediate recall of digits 
have a low correlation with other, more valid tests of intelligence. Yet, 
memory span for digits continues to be used because it is helpful in 
detecting the mentally defective, whose span is often very short (gen- 
erally less than five digits forward and less than three backwards), and 
because very poor span is useful in making certain clinical diagnoses 
of organic defects. Poor memory span for digits, especially backwards, is 
also found at times in cases of persons who are unable to apply the 
attention necessary in solving more difficult mental tasks. 

6. Vocabulary test. This subtest consists of forty words arranged in the 
Order of increasing difficulty. We have already stated that most psychol- 
Ogists concerned with intelligence testing believe that a vocabulary test 
1S one of the most valuable types of material used in deriving an index 
of a person's general mental ability, if there have been no unusual de- 
velopmental or environmental handicaps. Thus, although the vocabulary 
Subtest was originally provided for use as an alternate or supplement, 
€xperience and statistical study demonstrated its value, so that it is now 
used as à regular part of the scale. Also, like Binet and many other 
Psychologists, users of Form I and the WAIS have observed that qualita- 
tive differences in word definitions, as given by various subjects, have 
Clinical value and educational significance in helping to reveal the nature 
of an individual's thought processes (depth, extent of analysis, nuances 
of Meanings, cultural background, bizarreness of definitions) and, in some 
Mstances, feelings, emotions, and values. - n. 

. 7. Digit-symbol test. The subject is shown nine divided rectangles; 
In. the upper half of each rectangle is a digit; in the lower half there 
5 à symbol. The key is followed by seventy-five rectangles (of which ten 
are practice samples) in which only the numerals are given. In each in- 
Stance, the subject is required to insert the appropriate symbol. This 
test, also known as a substitution test, requires the 7 sociation. of symbols, 
and involves speed and accuracy of performance. It also involves visual 
Memory, The purely motor factor, it has been found, is relatively unim- 
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illi y ed 
portant, except in the case of illiterate persons who are not Tees : 
to using pencil and paper, and, of course, those who have suffered no 

: EAM x mi 
muscular or other anatomical damage. The following is an item fr 
the digit-symbol test. 


8. Picture completion test. In this part, there are fifteen cards, each 
of which shows a picture that is incomplete in some detail (for example, 
à picture of a face with the nose missing). The testee is required to note 
and name the missing part." In some pictures the task is quite simple 
; but in others the deficiencies of the pictures are 


ever, this test is ina 
whole, it is said that 


m nonessential details” (74, p. 78). 


- This subtest utilizes nine identical cubes, some 


Situation without the use of 
language 

11. Object assembly test. "This subtest of the scale includes four “figure 
formboards” that repr i 


€ach cut into several parts 
© the whole. The inclusion 


“mutilated Pictures.” 


It was used by Binet ia his 
P tests, as well as in 


Binet revisions, 
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and their reconstruction into a mean- 
ingful whole. It has, in addition, clin- 


ical value of a qualitative kind; for it 
contributes to the examiner's under- 
standing of the subject's modes of 


perception, the degree to which he 
relies upon trial-and-success methods, 
and the manner in which he responds 
to his errors. 
Functions Involved in the 

aM a each of the eleven sub- Fic. 11.2. The disassembled hand 

, ctions involved may be (object assembly test). From the 
Psychologically analyzed as shown Wechsler-Bellevue Scale. By permis- 
below. These indicate the processes sion. 
that are operative in the most effec- 
tive performance on each of the subtests. This analysis should be distin- 
guished from a factorial analysis. The latter is a statistical technique 
employed to reduce the number of nonstatistically analyzed functions, 


on the basis of communality. 


Subtest Functions Influencing Factors 
-ra ntion " 
Informati Tongxanpe din P anon Cultural environment 
mati a nd organiz: 
tion Association a g mean 
of experience 

Reasoning with abstrac- Cultural opportunities 

Comprehension tions 8 Response to reality situa- 


Organization of knowledge tions 
Attention span 


Concept formation 
Opportunity to acquire 


Arithmetic Retention (of arithmetical ihetandamentallaride 
processes) metical processes 
Similariti Analysis of relationships A minimum of cultural 
US Te Verbal concept formation opportunities 
nt 45 
Vocabulary ae ae adm Cultural opportunities 
oncep 


Immediate recall 
Digit Span Auditory imagery 
Visual imagery at times 


Attention span 


e “Reasoning with abstractions” generally involves the processes of both analysis and 
synthesis, with the use of symbols—language and number. The testee must first analyze 
the relationships existing among the members, or parts, of the whole problem; then he 
must reorganize and interpret and, at times, create new wholes in order to reach the 


desired solution. 
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Subtest Functions Influencing Factors 
Visual perception of 

relationships (visual A minimum of cultural 

Picture insight) opportunity 

Arrangement Synthesis of nonverbal Visual acuity at times 
material 
M Visual perception: analysis Environmental 
icture 


T Visual image experience 
Completion Esty : P T . 
Visual acuity at times 


Object Assembly Visual perception: synthesis Rate of motor activity 


Visual-motor integration Precision of motor activity 
Perception of form R f ivit 
: 3 . ; activi 
Block Design Visual perception: analysis i d nee » ty 
, z x i sion 
Visual-motor integration PE UNUOESCOLOTY! 


Immediate rote recall 
Digit Symbol Visual-motor integration 


Rate of motor activity 
Visual imagery 


verbal comprehen- 


“space,” or “visual-motor organization”; a general, ne Moni ues: 
factor. The most significant of these in accountin for Hm y test 
scores is the general factor.10 Thus, in Spite of the dise ine a Of 
materials used in each of the subtests, th aoe HS 


€ same or similar mental opera- 


r, variously na 


* This factor appears to be the same as Spearman’, 
an's i : i 
of correlates. See Chapter 7. Suction! of relations and eduction 


* See page 161 for a discussion of the Meaning of the Beneral facto 
T, 
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tions are involved to a considerable degree; that is, they show a high 
degree of "communal variance." Wechsler reports that g accounts for 
about 5o percent of the total contributed by all the tests, and from 
66 to 75 percent of the communal variance shared by two or more tests 
(74, pp. 121-122). 

Nur Reliability: SPLIT-HALF. The manual (73) reports split-half re- 
liability coefficients and standard errors of measurement based upon 
results obtained with three age groups: 18-19 (200 subjects), 25-34 (300 
subjects), and 45-54 (300 subjects). There are only slight differences be- 
tween coefficients found for each group; therefore, only those for one age 
group are shown in Table 11.1. 


TABLE 11.1 


RELIABILITY COEFFICIENTS AND STANDARD ERRORS OF MEASUREMENT: 
18-19 Year AGE Group * 


Subtest T SE, 
Information 91 0.88 
Comprehension -79 1.36 
Arithmetic 49 1.38 
Similarities 87 1.11 
Digit Span 41 1.63 
Vocabulary 94 0.69 
Digit Symbol 92 0.85 
Picture Completion 82 1.18 
Block Design .86 1.16 
Picture Arrangement .66 1.71 
Object Assembly «65 1.65 
Verbal IQ 96 3.00 
Performance IQ 93 3.97 
Full-Scale 1Q 97 2.60 


ine 


* The standard error of measurement is given in units 
of the scaled scores. The 1Q reliabilities are given, of 
course, in IQ units. Two of the subtests show more than 
“slight differences” in coefficients for the three age 
groups: arithmetic (79, .81, .86); picture arrangement 
(.66, .Go, .74). 

Source: Wechsler (73, p. 103). By permission. 


The reliability coefficients for the three types of IQ are highly satis- 
factory. 'Their standard errors of measurement, furthermore, indicate 
high "absolute" reliability; that is, the probabilities are 68 in 100 that 
an individual's obtained IQs on the WAIS are within less than 4 points 
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of his true nonverbal (performance) IQ, within 3-00 of his true verbal IQ, 
and within 2.6 of his true full-score IQ. 


While the subtest reliabilities, on the whole, are not as reliable as the 
three IQ ratings, they are, however, with three exceptions, high enough 
to warrant considerable confidence in their results, since eight of the 
eleven reliabilities are -79 or higher. In making differential diagnoses, 
based upon patterns of subtest Scores, the specific reliability indexes 


(coefficients and standard errors of measurement) must be taken into 
account. 


probably better. 


Wechsler's manual reports only meager data on reliability. Fifty-two 


individuals were retested at intervals of one month to one year, with the 
results shown in Table 11.2. 


TABLE 11.2 


RETEST CORRELATIONS FOR Form I 


Ages N Rho * PE 


10-13 32 .94 013 
20-34 20 94 018 


* Rank order correla 


tion coefficient. 
Source: Wechsler (73, p 


+ 133). 


"A coefficient of .79 is not an established cut-ofr oint; but it; : : and .80 
is a reasonable reliability coefficient for a suben. IE ut it is practically .80; an 


; i A he coefficients in Table 11.1 are 
superior, with two exceptions, to those obtained for Form I 


» Pressing, and persistent prob- 
subjected the instru- 


— 
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TABLE 11.3 


TEST-RETEST RELIABILITY OF FonM I REPORTED 
FOR ABNORMAL GROUPS 
(7 SrupiEs) 


Subtest Range of coefficients 
Information .56-.99 
Comprehension .12-.78 
Digit Span 59-.77 
Arithmetic .68-.87 
Similarities .38-.93 
Vocabulary -90-.93 
Picture Arrangement 49-86 
Picture Completion .82-.89 
Block Design -65-.87 
Object Assembly 31-79 
Digit Symbol 34-91 
Verbal IQ 76-91 
Nonverbal IQ 52-94 
Full-Scale IQ .55-.90 


ee. ee ee ————————_ 


appreciably below that of the scale as a whole; and (2) that in all but 
One of the reports (in which the subjects were schizophrenics and f -55) 
full-scale reliability appears to be reasonably satisfactory, considering the 
instability of the groups used. (Other r values were .87, .84, .84, .87, 
B9 90.) imeem 
From published reports, it appears, that only one investigation, using 
a fairly adequate sampling of individuals, has been devoted to reliability 
of the Bellevue when administered to normal subjects (23). The test- 
retest method was employed. The age range was 20 to approximately 
5° years. One group of sixty subjects was retested after a one-week inter- 
Val; another group of sixty persons after a four-week interval; a third 
group of thirty-eight subjects after a six-month period. The major find- 
ings were the following: 
The mean score for every subtest and for the total scale increased for 
all three groups. 
Increases in scores tend to be somewhat smaller as the retest interval 
is increased. 
The smallest average increase was 0.3 point in weighted score for com- 
prehension retest (after four weeks). 
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The largest average increase was 2.8 points in weighted score for picture- 
arrangement retest (after one week). 


Largest average increases (2 or more points in weighted score) were found 
for picture arrangement and object assembly. 


Smallest average increases (less than one point in weighted score) were 
found for information, compreliension, and similarities. 


Average changes in IQs were: verbal scale, 4.4 points; nonverbal 1Q, 
9-1 points; full-scale IQ, 7.6 points. 


Retest correlations and standard errors of measurement for all subtests 
and the three IQs are shown in Table 11.4. It will be noted that for the 


TABLE 11.4 


TEsr-RETEST CORRELATIONS AND STANDARD ERRORS OF 
MEASUREMENT FOR Form I 


(N = 158) 

Subtests Correlations SE meas. 
Information .86 68 
Comprehension JA 1.21 
Digit Span 67 1.68 
Arithmetic 62 2.06 
Similarities -71 1,22 
Vocabulary 88 43 
Picture Arrangement 64 1,82 
Picture Completion 83 95 
Block Design 84 1.10 
Object Assembly 69 1.31 
Digit Symbol .80 1.06 
Verbal IQ .84 3.96 
Nonverbal IQ .86 4.49 
Full-Scale IQ 


-90 3.29 


subtests, four of the coefficie 
are in the .7os (low reliabilit 
reliability for a subtest), 
Comparison of the st 
Form I, with those in Table I1 


the two versions. In the 
ment are slightly smaller in seven 
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subtests, and slightly larger in four. Also, this same index is smaller for 
the three IOs of the WAIS. There are, furthermore, quite significant 
differences between the reliability coefficients in arithmetical problems, 
similarities, digit symbols, and the three IQs. Although, on the whole, 
the statistics favor the WAIS, one precaution should be noted in com- 
paring and interpreting them. It is that the data for Form I are derived 
from test-retest scores, and the data for the WAIS are from split-half 
reliabilities; the latter are expected to be higher and more favorable 
than the WAIS (see Chapter 4). 

Validity: INTERCORRELATIONS OF SUBTESTS. These correlations 
are always necessary to provide data regarding the presence or absence of a 
§ factor. For this scale, appreciable intercorrelations would be required as 
one aspect of their validity, that is, construct validity.!? The following 
Coefficients are reported for a group of goo persons, of ages 25-34 (73, 


P. 100). 


Range is from -30 (Object Assembly with Digit Span) to .81 (Vocabulary with 
Information). 

The range of the highest three fourths of the coefficients is from .46 to .81. 

The modal interval is .50—.59. 


For a wide range of ages from 18 to 754- years, classified into seven cate- 
Bories, the median intertest correlations were as follows: 


verbal: .59 to .66 

performance: .53 to .59 

verbal with performance: .41 to .54 
all tests: .46 to .57 d 


The preceding correlation coefficients indicate that each of the several 
subtests has much in common with every other one regarding demands 
"pon the same or similar mental operations, though in varying degrees. 

More significant, however, are the coefficients found between scores 
of each subtest correlated with total scores of all other parts of the scale.14 


These coefficients are as follows (74, p- 99): 


Range is from .46 (Picture Arrangement) to .84 (Information). 
Range of the highest three-fourths is from .65 to 84. 
The modal interval is from -70 to .79. 


" Some writers state this aspect is evidence of the test's content validity. A distinction 
etween the two—construct and content—is not always possible. It will be recalled that 
Construct validity has been characterized as "sophisticated content validity." 
| It is obvious that if scores of a given subtest were correlated with total scores that 
include those of the subtest in question, the resulting coefficient would be in part a self- 


Correlation, and spuriously high. 
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These correlations indicate that the WAIS, on the whole, has a reason- 
ably satisfactory degree of construct validity. The findings in regard to 
the factorial content of the scale, already presented, provide further 
evidence on its construct validity. 

CORRELATION WITH SCHOOLING. Beginning with Binet, the amount of 
schooling and quality of educational achievement were used as criteria 
of validity. Hence, the ratings on this scale were correlated with years 
of schooling for three age groups. These are 18-19 years (N — 200); 
25-34 (N = 300); 45-54 (N = 300). The coefficients were, respectively: 


verbal score: .688, .658, .718 
performance score: .597, -570, 614 
total score: .688, .658, .718 


These correlations are, on the whole, higher than those found for 


Form I; and, again, they are reasonably satisfactory for the criterion used. 


CHANGES IN SCORES WITH INCREASING AGE. The mean scores of the 


verbal, performance, and full scales increase moderately and gradually 
from age 16 to 29 years. Thereafter, they decline moderately and grad- 
ually until age 64. These results are consistent with prevailing psycho- 
logical theory, except that the upper limit, in the 25-29 year age group» 
is higher than that found with other tests. The standard deviations of 
the scaled scores, however, are rather close throughout the entire age 
range, varying from 23.6 (at 16-17 years) to 27.0 (at 45-49 years). With 


the exception of the two extreme values, all SDs fall between 24 and 26 
points (see Fig. 11.9). 


Meon Scores 


MEAN SCALED SCORES 
o 
o 


GENER LL OE Lu 


16 iB20 25 30 35 40 


45 50 55 60 65 70 75 
AGE 


Fic. 11.3. Changes with age in full-scale scores of the 


igence 
Scale. Ages 16 Wechsler Adult Intelligen 


5 and over. 
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It is desirable to know the variability of scores not only in terms of 
standard deviations but also in terms of coefficients of variation, showing 
the ratio of the SD to mean (SD/M x 100), that is, the relative variation 
of scores at each of the several age levels. "These indexes vary from 21.91 
at years 25-29, to 28.3 at years 55-59. This index is nearly uniform from 
age 16 to 44, since it is between 21.91 and 23.57; but after age 44 it ranges 
from 25.76 to 28.30. The reasons for these differences are not established. 
They might be defects in the test itself, cumulative effects of specializa- 
tion of interests in adulthood, or differential effects of advancing age 
upon mental operations. 

RANGE AND DISTRIBUTION Or IQs. It will be recalled that two of the 
characteristics of a valid psychological test are that it shall yield a satis- 
factorily wide range of scores to encompass the large differences in human 
abilities, and that this distribution shall be a close approximation to 
the normal (Gaussian) curve. Figure 11.4 represents the WAIS distribu- 


280 
260 
240 
220 


-N 
oo 
oo 


NUMBER OF CASES 
2 
o 


45 8o 85 do 65 70 75 80 85 90 95 100105110 115 120 125 130 135 140145 150 
INTELLIGENCE QUOTIENTS 


Fic. 11.4. Distribution of Wechsler Adult Intelligence Scale intelligence quo- 
tients. Ages 16-75 and over (2052 cases). 


tion of IQs for 2052 cases between the ages of 16 and 75 years: The range 
is from approximately 45 IQ to 155 IQ. This range 15 wide enough to 
include almost all cases; but a more comprehensive sampling of the 
population should yield a very small fraction of one percent of 1Qs above 
155, and a very small fraction of one percent below 45. This distribution, 
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however, does establish that the scale is able to differentiate among most 
persons, that it measures continuously, and that it does not yield an un- 
due concentration of scores at any of the levels.15 

CORRELATIONS WITH THE STANFORD-BINET AND OTHER Scares. In vali- 
dating a new scale for testing intelligence, it is common practice to 
correlate the ratings on the new instrument. with those obtained for the 
same individuals on the Stanford-Binet. This practice is tantamount to 
accepting the Stanford-Binet as a reasonably valid scale; that is, as one 
sound criterion of validity against which to evaluate the new scale. 


ports a coefficient of .82 (PE = -026) between the W-B and the S-B, for 
S of 14 and 16 years. Six correlation 
studies of these two instruments, made by others, yielded coefficients 
:93 (PE =.01). Coefficients of partial 
held constant, are not given; but it 
of these studies is such as to have 


; ^ (1955), only one correlational study 

des T E uo E (74, p. 105). Fifty-two inmates of a reformatory, 
Bes of 16 and 26 we 4 : 

following results: wo tested with both scales, with the 


S-B with full-scale IQ=.8 
S-B with verbal-scale IQ — 80 
S-B with performance.sca]e IQ = .69 


verbal test; see Chapter 15) for CO; arativ I ollowing 
m 
P e purposes. he f 


15 t H H 2 " 
a fice eae ined DI ssian) S, Psychologists that the distribution of IOs 
view. He apparently interprets the curve in Ii. seeps) pem he does not hare Ga 
the lower end. This curve is, however, a fairly good a ‘4 as being somewhat skewe 
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Matrices with full-scale IQ — .7» 
Matrices with verbal-scale IQ —.58 
Matrices with performance-scale IQ — .7o 


"Table 11.5 summarizes the correlations found between the S-B and the 
W-B (1939) scales. Although there have been surprisingly few studies com- 
paring the two Wechsler scales and the Stanford-Binet, the available data 
show that there is substantial correlation between them, especially be- 
tween full-scale IQs and those of the S-B, and verbal-scale IQs and those 
of the S-B. On the other hand, as would be expected, correlations be- 
tween Stanford-Binet intelligence quotients and the performance-scale 
IQs are only moderate. 

It is unfortunate, too, that most of the studies comparing the two in- 
struments have used atypical groups, such as hospital patients, clinic 
referrals, and prison inmates, to the neglect of individuals who fall within 
the categories of normal behavior and adjustment. 


TABLE 11.5 


COEFFICIENTS OF CORRELATION BETWEEN THE STANFORD-BINET 
AND Form I (FREQUENCIES) 


ee Oe ry rn ee 


With With With 
A full-scale IQ verbal IQ performance IQ 
90 5 2 
.85 2 2 
.80 1 1 1 
45 2 1 
-70 1 
65 1 
60 1 1 
55 1 
-50 2 
35 1 


AL cd SEs s. cu wm Ee Se UR eee 


Since the WAIS is regarded as superior to Form I, it is reasonable to 
conclude that in comparisons with the S-B, the correlation coefficients 
found for the latter would be smaller than those found for the former. 

Correlation coefficients indicate degree of relative agreement of paired 
Scores, but not the absolute differences between them. It is necessary, 
therefore, to know the extent of IQ differences between the correlated 
Values. On the whole, it has been found that the differences are not 
large, although large discrepancies will occasionally appear at the two 
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extremes of the distribution. In studies of retarded and mentally de- 
ficient individuals (say, the lowest decile group), the Wechsler scales 
yield somewhat higher IOs than the S-B. At the upper level of mental 
ability, however (say, the highest decile group), the Stanford-Binet yields 
somewhat higher IQs. 

Comparison of IQs is complicated by the fact that age of testees is also 
a factor. Taking the population samples as a whole (rather than only 
the extreme groups), the following are the general findings. (1) Within 
the age range of approximately 10 to 19 years, the Stanford-Binet IQs 
tend to be somewhat higher. (2) From age 19 to about 35, the intelligence 
quotients tend to be about the same. (3) Above the age of 35, the Bellevue 
intelligence quotients tend to be somewhat higher.16 

In the case of any given individual, therefore, comparison of Stanford- 
Binet and Bellevue IQs must take into account both ability level and 
chronological age. In a given instance, a person's age and ability level 
may be such as to increase or decrease the difference between the IQs 
obtained with the two instruments.17 

This discussion of the comparison between the S-B and Wechsler scale 
IQs is based upon results found with Form I of the latter and with the 


yielded varying results, dependi 
of the subjects. Coefficient i 


us Vauprry. E a test of the Validity of Form I, its author 
applies the pragmatic criterion. He states: "How do we know that our 
tests are 'good' measures of ; i 


groups; and (2) by napis? VƏS: (1) by analyzing 
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other intelligence test. . . . Empirical judgments, here as elsewhere, 
play the role of ultimate arbiter. In any case, all evidence for the validity 
of a test, whether statistical or otherwise, is inevitably of an indirect sort 
and, in the end, cumulative rather than decisive" (74, p. 127). In other 
words, it has been found by the author of the scale, and by others, that it 
works with reasonable satisfaction in clinical practice. 

In discussing the validity of the 1955 scale, Wechsler states that it 
satisfies the following criteria: ratings by selected judges (generally teach- 
ers in the case of children); conformity with the normal growth curve 
of mental ability; and comparisons with over-all socioeconomic achieve- 
ment (especially in identifying and appraising the mentally deficient). 
Both versions of the scale, states Wechsler, meet these criteria (74, p. 109). 

The foregoing conclusion is based largely upon results found with 
Form I. For example, teachers’ estimates of their pupils’ intelligence, 
rated on a six-point scale, for a group of seventy-four adolescents in a 
trade school, correlated .52 with IQs derived from Form I. For another 
group (forty-five in number) in a general high school, the coefficient was 
48. These numbers are, unfortunately, not large enough, although they 
indicate the probable trend. i 

In one study, two groups were differentiated on the basis of total scores: 
(1) a borderline group, having IQs between 66 and 79; and (2) a mentally 
defective group, having IQs between 5o and 65. The problem. was to 
determine whether each of the eleven subtests contributes significantly 
to the differentiation of the two groups. Since the mean scores on each 
of the subtests for the two groups did differentiate, and since the differ- 
ences between the means were in the directions that should be expected, 
it was concluded that each subtest did contribute to over-all differen- 
tiation, although Digit Span and Object Assembly contributed relatively 
little (69). [ , 

A second study used only the verbal subtests with naval recruits as 
Subjects. The problem here was to learn whether these subtests distin- 
guish between (1) the mentally defective and the borderline, or (2) the 
borderline, the dull normal, and the normal. The findings indicated that 
each of the verbal subtests contributed to the differentiations between 
these groups. The Digit Span subtest in this instance, however, proved 
to be as effective as the others (45)- 1 

Extensive research information has not been provided on the predic- 
tive validity of the WAIS in educational guidance. It is a reasonable 
assumption, however, that this more recent instrument, an improvement 
over the first version, will be at least as effective in these aspects of test 
validity. 
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One comprehensive study of 161 college freshmen (55) concludes that 
the WAIS may be used for educational prediction with more confidence 
than one of the well-known group tests (The American Council on Edu- 
cation Psychological Examination). In this study the following correla- 
tions with grade-point averages were found: 


WAIS verbal score 58 
WAIS full score 53 
WAIS performance EI 


A.C.E. linguistic score 46 
A.C.E. quantitative score .18 


In a clinical situation, the prognostic value of a psychological test is 
dependent upon the soundness of the clinical diagnoses with which the 
test's findings are compared. This fact presents a major problem; for 
classification into clinical categories is difficult, often unreliable, and 
subject to the clinician’s theoretical orientation. The major exception 
is the diagnosis of mental deficiency, for the determination of which a 
sound individual test of general intelligence, when administered and 
interpreted by a qualified psychologist, is the most valid single instru- 
ment. For this purpose, the Stanford-Binet and Wechsler scales have 
proved to be most valuable, A diagnosis of mental deficiency, of course, 
is ordinarily not made upon the basis of IQ and MA alone, although 
in some instances, when the background of an individual is known, the 


findings with a single test are sufficient because they are clear and un- 
equivocal. 


ae ny anges in par 
uous, Muc ; ab- 
ulary subtest. In the case a the same is true ofthe yop 


, 


(Object Assembly) With such 
_ test items, if this populatio? 
ifferentiate well among individ- 
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Scoring and IQ Calculation 


Scoring. All parts of this scale are scored on a point basis. 
For some subtests, the earned raw score is simply the number correct, 
each item being scored either plus or minus (for example, Information). 
In the subtests of Comprehension or Similarities, the score for each item 
is 0, 1, or 2, depending upon quality of the response. In other parts, as 
in Arithmetical Reasoning or Block Design, the earned raw score is based 
not only upon correct responses, but upon the time taken to solve the 
problem. Thus, the factor of speed of performance is involved in sections 
of this scale, especially in nonverbal subtests. 

The raw score for each subtest is first obtained by addition of the 
credits on the items in that part. This raw score is converted into a 
weighted score (a type of standard score), by means of a conversion 
table. The purpose of this conversion is the customary one of placing 
all subtest scores on a comparable basis. The weighted scores for all 
parts of the scale are added to obtain the full score upon which the full- 
scale 1Q is based. Also, the weighted scores of only the six verbal parts 
are added to get the verbal score, upon which the verbal-scale IQ is 
based. Similarly, the weighted scores of the five performance tests are 
added to get the performance score and performance-scale 1Q. 

The following well-known formula was used for equating each sub- 


test’s raw score into weighted scores. 


SD» 
= Ms 2 (X, — My), 
Xs = M: + sp. OG 1) 


an arbitrarily assigned mean of 10 * 

an arbitrarily assigned standard deviation of 3 

X» = the weighted score to be found 

M, = the mean of the subtest's raw scores A 

= the particular raw score to be converted to a weighted score 


the standard deviation of the subtest's raw scores. 


This formula (1) assigns an arbitrary and uniform mean score to all 
subtests; (2) multiplies each individual score's deviation from its mean by 
à constant ratio; (3) adds the result to, or subtracts it from, the assigned 
mean.! By using this formula, scores are so converted that each individual 
maintains his relative status on each subtest. And in the case of any 


an 
es 
Hg 


“The best way for the student to see how this formula works is to substitute several 
Sets of values in it and to observe the outcome. The logic of the process will then be more 
readily apparent. For example, if the two following sets of values are substituted in the 
formula, the process will be clear. Assume the following data for one subtest: mean 
Score = 12, SD = 4, X = 15; while for a second subtest the corresponding values are 
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given person's subtest scores, differences between scores will be attrib- 
utable, theoretically, to differences in his performance level rather than 
to differences in the weighting of each subtest in the total. It is thus 
possible to vary the number of items in each of the several subtests with- 
out giving any of them unequal weight in the total score upon which 
the IQ is based. 

The reason for converting raw scores into weighted scores is that the 
possible maximum raw scores vary in the several subtests of the scale. 
If, therefore, the- raw scores were simply added to obtain an individual's 
rating on the scale, each of the parts would carry a different weight in 
the total; each part would have the possibility of contributing differently 
to the final result —-some more heavily than others. The raw-score units 
of one part of the scale would not have the same significance as those of 
other parts. If this were the case, then implicit in the scoring would be 
the assumption that certain of the psychological functions being tested 
should be regarded as more important than others in obtaining the total 
Score and in deriving an index of intelligence. The W-B scale, however, 


could be derived. Thus, an individual 
full scale earns an IQ of 100; 
below the mean earns an IQ of 85; 
of 70, and so on, above the mean as 
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ard percentile distribution tables, it is possible to determine the percentile 
rank that corresponds to any deviation IQ. Thus, an IQ of 115, being 
one SD above the mean, corresponds to a percentile rank of 84.1? (See 
Chapter 6 for a discussion of relative scores.) 

In using a deviation IQ, the principle adopted is that an individual's 
intelligence quotient should indicate the relative extent to which his 
scaled score deviates from the mean of his own age group. 

COMMENTS ON THE Deviation IQ. The conventional, or ratio, IQ 
(MA/CA) relates the test score of an adult to the performance level of 
an average group at a specified maximum age. The 1916 S-B placed this 
age at 16; the 1937, at 15. The deviation method, at all ages, relates an 
individual's score to the average (mean) performance of his own age 
group. This method obviates the necessity of determining the age of 
"average adult MA"—always a difficult problem and as yet uncertain. 
(The W-B, Form I, and the WAIS do not use mental age.) 

One problem presented by a deviation IQ—and probably a weakness 
of it—is that the same, or constant, objective performance on the test 
will give an individual a higher rating with increased age after the 
maximum level has been reached, since his constant score will be com- 
pared with declining age norms. 

Norms of the WAIS show a moderate but steady decline after the age 
interval 25-84. Obviously, after the period of decline sets in (however 
moderate), as shown by the test scores, an individual's IQ will decline 
gradually if the ratio IQ is used; whereas, using a deviation IQ, his rating 
will decline only if his losses are greater than the average losses of his 
own age group. If his losses are less than the average rate, his deviation 
IQ will rise.20 


Deterioration and Scatter 


This scale provides a scheme for calculating a “deterioration 
quotient,” based on the premise that certain types of tested mental proc- 
esses decline more rapidly than do other types; and that the difference 
between rates of decline, as between these two types, 1n the case of any 


? Although the standard deviation of the Stanford-Binet is 16, the difference between 
it and the WAIS SD of 15 does not have practical significance. M 

® The reader should note that we have emphasized decline in test score. This does not 
necessarily mean that on the whole a person becomes progressively less. intelligent 
before the effects of senescence become apparent. While it is true that there is some loss 
In average test scores after about age twenty-five, it is also true that some mental traits, 
as yet unmeasured by intelligence tests, increase in effectiveness through an extended 
period of adulthood and more than compensate for losses in the processes measured by 
current scales. This view is borne out by the facts regarding ages of maximum achieve- 
ment of scholars, scientists, writers, and artists. 
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ziven person, indicates his relative degree of deterioration. In other 
words, there are certain tested functions that hold up with age and others 
that do not. This index will be explained in more detail in a later 
chapter, together with other tests devised to measure deterioration of 
nental abilities. 

A second feature of the scale is its emphasis upon "scatter analysis"; 
hat is, analysis of an individual's performance on the several parts of 
he scale for the purpose of facilitating clinical analysis of the subject's 
oerformance. Such analysis may lead. to diagnostic inferences concerning 
ersonality characteristics and behavior disorders owing to organic brain 
lisease, psychosis, psychoneurosis, adolescent psychopathy, and mental 
leficiency. Here again, this application of the scale will be presented in 
| subsequent chapter on clinical uses and interpretations of tests. 

Short Scales. As in the case of the Stanford-Binet, the use of 
hort scales has been suggested by some psychologists. Several have been 
roposed (12, 24). A short scale is one from which some of the subtests 
ave been omitted in their entirety, and the scores are prorated in order 
o make them comparable to scores of the full sca 


btaining an IQ. The use of a short scale has bee 
aver. 


Although the correlation 
cale IOs have been consiste 
ests—usually in the .gos—u 


xcept for rough screening. If decisions 
ividuals, 


le for the purpose of 
n proposed as a time- 


version is not advisable 
are to be made affecting in- 
ise practice. Furthermore, each 
ern of performance which is de- 


UE gence provides an opportunity to 
ake valuable qualitative observations of the behavior of the testee, 
b A ag 5. Thus, reducing the number of 
1Dtests reduces opportunities for making significant observations. 


: Northeast, North Cen- 
er, 1700, was evenly divided be- 
are consiste i d b 
ie Census Bureau. In regard to urban. are maad g 
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tion group conforms closely to the percentages reported in the United 
States census of 1950. In view of data already presented on the several 
aspects of the scale's internal validity, it appears that this group of 1700 
is an adequate one; although the scale's predictive validity has not yet 
been demonstrated. 

Are the subtests a variety of disconnected types? In view of the find- 
ings already presented in this chapter, under the heading of Factorial 
Composition, the answer to this question must be distinctly in the nega- 
tive. This answer is supported by the nonstatistical analysis of functions 
involved in each of the subtests, also discussed in an earlier section of this 
chapter. 

Are the verbal subtests culturally unfair to some persons? The answer 
to this question is the same as that given for the Stanford-Binet. In this 
scale, as in the S-B, the verbal materials under Comprehension, Simi- 
larities, and Arithmetic are stated in terms that place little premium upon 
educational or other cultural advantages. As in all tests of information 
and vocabulary, success on these parts is dependent, in part, upon op- 
portunity to learn, whether in school, home, or through one's own ex- 
ploitation of all aspects of the environment. à 

Are some of the test items obsolete? Items in all tests must be reviewed 
periodically for possible obsolescence. In the case of this scale, a number 
of the items, especially Comprehension, Information, Vocabulary, Picture 
Arrangement, and Picture Completion, should be re-examined and re- 
evaluated periodically. Also, the satisfactory, partially satisfactory, and 
unsatisfactory responses for a number of items should be reviewed and re- 
vised in the light of responses that have been obtained since the scale's 
publication. d , ; 
Is the factor of speed important? Unlike the Stanford-Binet, in which 
very few test items are timed, some WAIS scores are significantly affected 
by the speed factor. Speed of performance yields additional credits in the 
subtests that follow: Arithmetic, Picture Arrangement, Object Assembly, 
Digit Symbol, and Block Design. Thus, in the total score, speed of work 
is combined with power (or ability level). Although in general, speed and 
power are highly correlated, it is also a fact that response-time slows down 
with age. Thus, since this scale is designed for adults, it might measure, 
to an important degree, decline in speed of response and not necessarily 
decline in power, particularly in later adult years. This tanton u be 
kept in mind when, in a later chapter, We consider the “decline” of 
abilities and the suggested use of a “deterioration” index. i 
Do the nonverbal subtests involve visual acuity? Although no experi- 
mental data are available in answer to this question, several users of the 
scale have observed that visual acuity might be a factor in some in- 
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stances. The subtests most likely to make some demands upon visual 
acuity are Picture Arrangement and Picture Completion. And, of course, 
color blindness must be considered as a factor in the Block Design sub- 
test. 

Are the reliability coefficients satisfactory? Available data indicate that 
total verbal scores and total performance scores have a satisfactory degree 
of split-half reliability, as do full-scale scores. While the split-half Kes 
liabilities of some of the subtests are high, some are only fair, or low. This 
fact must be taken into account when response patterns are being used 
for differential diagnosis or for evaluating deterioration of mental ability. 
Absence of test-retest reliability data is a significant deficiency. 

Should the verbal and the performance scores be combined? The tables 
of norms show that the maxim 
year age group, while the maximum performance norm is reached by the 
20-24 year group. Some have questioned the practice of combining both 


examiner will take note of 
performance IQ in any insta 
Is the scale clinically useful? 
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disorders has yet to be unequivocally demonstrated. In spite of the fact 
that this is the area in which most of the evaluative studies of the scale 
have been made, clinical findings are not definitive. 

In attempting to diagnose personality and behavior disorders on the 
basis of the pattern or profile of scores on the Bellevue or other scales, it 
must be remembered also that different educational backgrounds and 
cultural factors, quite unrelated to personality and behavior disorders, 
could account to some degree for an individual's inconsistency of per- 
formance on the several parts of the scale. It has been found, too, that 
individual variations in interests find expression in different patterns of 
mental activities and might be reflected in subtest variations. However, 
the results obtained by means of this scale, plus clinical experience and 
acumen, provide a valuable combination for study of individual dif- 


ferences and individual mental functioning. 
The Intelligence Scale for Children (1949) 


Description. This scale for children from 5 through 15 years 
of age (WISC) is developed on the same principles and in the same form 
as the WAIS: verbal subtests, performance subtests, a verbal IQ, a per- 
formance 1Q, and a full-scale 1Q. | 

The subtest types are identical with those of the older scale, with the 
€xceptions that follow: Digit Span is made optional; an optional maze 
test has been added; and in place of the Digit symbol test, a coding test 
has been substituted, in which various lines in varied positions (single, 
double, circle) are associated with geometric figures (star, circle, triangle, 
Cross, rectangle). N 

Standardization Population. The scale was standardized on a 
sample of one hundred boys and one hundred girls at each of the eleven 
age levels, each child being tested within one and one-half months of his 
midyear. 1 

Selection of the 2200 children was based upon (1) rural-urban resi- 
dence; (2) father’s occupation; and (3) geographic area. The proportions 
in these sampling factors were based upon U.S. census data for 1940, 
‘<. . with some adjustment for the shift of population toward the 
West.” In the final selection of the standardization sample, geographic 
area percentages are reasonably well satisfied; urban-rural percentages, 
less well; and father’s occupation percentages, moderately. 

Reliability. Splithalf coefficients were found for three age 
groups (735, 1014, 1314), 200 in each. The findings are summarized in 
Tables 11.6 and 11.7. It will be noted, from these data, that the subtest 
reliability coefficients vary markedly, and are, on the whole, only moderate 
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in size. The IO reliabilities, however, ranging from .86 to .96, fall re 
the range that is generally acceptable. These data demonstrate aga = 
necessity of distinguishing between reliability of part of a scale 
reliability of the whole scale. 


TABLE 11.6 


RELIABILITY DATA: INTELLIGENCE SCALE FOR CHILDREN 
Subtest reliabilities 


Age group Range of r's Mean Highr Lowr 
E Block Desi Comprehension 
1%, 59-84 67 o gn d 
Completion 
107, 59-91 .76 Vocabulary Digit Span 
13% -50-.90 75 Vocabulary Digit Span 
IQ reliabilities 
Verbal Nonverbal Full 
Ty, 88 86 s 
10% 96 89 a 
13% 96 90 T 


(Digit Span, Coding, and Mazes are not included.) 


Source: D. Wechsler (71). By permission. 


The standard error of m 
within which the chances ar 
true score will fall in tha. 
of 1.20 for 7 s-year-olds 
probabilities are two to o 
subtest is within 1.20 poi 
standard error of 4.2 51Q 
the probabilities are abo 
on this scale is within 4. 

The split-half reliabili 
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basis of the standardization data, it appears that considerably more con- 
fidence can be placed in those indexes than in the scores of the indi- 
vidual subtests (with the exception of vocabulary). And because there are 
marked differences among reliability coefficients of the subtests for each 
of the three age groups, it is highly desirable that separate reliability 
studies be made for each of the eleven age groups, especially at the ex- 
tremes of the age distribution for which the scale is intended. 


TABLE 11.7 


STANDARD ERRORS OF MEASUREMENT: INTELLIGENCE SCALE 


FOR CHILDREN 
a ae e EE E E E EE 


Subtest standard errors 


Age group Range * Mean High Low 
7 1.20-2.45 1.74 DigitSpan Picture Arrange- 
ment 
101% .90-1.92 1.44 DigitSpan ^ Vocabulary 
155, .95-2.12 1.47 DigitSpan ^ Vocabulary 
IQ standard errors t 

Verbal Nonverbal Full 
75, 5.19 5.61 4.25 
10% 3.00 4.98 3.36 
134 3.00 4.74 3.68 


a PO tal ek ee 
* . 
standard errors of measurement of subtests are given in units of the weighted scores. 
l Standar errors of intelligence quotients are given, of course, in IQ points. 
URCE: Wechsler (71, p. 13). By permission. 


Ina study of the test-retest reliability of this scale, sixty children were 
first examined in the fifth grade; after four years they were re-examined 
In the ninth (82). The following correlations were obtained: performance- 
Scale IQ, 74; verbal-scale IQ, .77; full-scale IQ, .77. 

Validity. Subtest Intercorrelations. In the manual for this 
scale, there are no data on the problem of validity as such. There are data 
ino ercorrelations of the subtests. The assumption is that significant 
Correlations between subtests would validate the hypothesis that 
E and the scale as a whole measure common factors. However, the 
MI S DRE coefficients among the individual subtests are, on the 
UEM as high as would be expected. At the 74-year level, these co- 
Wee are concentrated within the .205 and .gos; at the 1075-year level, 
ERAN: concentrated within the .30s and .405; while at the 18Y4-year 

» they are distributed within the .20s, .30s, and .40s. 
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On the other hand each verbal subtest correlates quite significantly 
with total verbal score, the range for the three age groups being from .44 
to .82, with the coefficients fairly evenly distributed. The nonverbal sub- 
tests correlate somewhat lower with total performance scores, the range 
being from .32 to .68, with some concentration in the 50S. 

The correlation coefficients between total verbal scores and total per- 
formance scores are, respectively, .60, .68, and +56 for these same age 
groups. 

"These findings indicate that, on the whole, although each subtest has 
only a moderate amount of communality with the others taken singly, 
verbal subtests combined have much more communality with each indi- 
vidual verbal subtest.21 The same is true of combined performance and 
separate performance scores. 

Finally, the data indicate that all the verbal subtests taken as a whole 
have considerable communality with all the performance subtests as a 


TABLE 11.8 


CORRELATIONS BETWEEN THE INTELLIGENCE SCALE FOR 
CHILDREN AND OTHER SCALES 
(5 Srupigs) 


Other scale Subjects N r 
Arthur point scale mentally defective 40 -79 (Full scale) 
eee E: a0 88 (Nonverbal scale) 
; 2i f 40 47 (Verbal scale) 
Stanford-Binet. L 7 5 40 -76 (Full scale) 
E 3 L s 40 .64 (Nonverbal scale) 
; id 40 .75 (Verbal scale) 
Stanford-Binet subnormals 70 68 We scale) 
: p 70 -69 (Verbal scale) 
Stanford-Binet normals 49-53 .85 ul scale) 
« c n 49-53 .82 (Verbal scale) je) 
i 49-53 80 (Nonverbal scale 
ep cunei 4 49-53 80 (Full scale) 
“u in s - 49-53 .77 (Verbal scale) 
Stanford-Binet, L “ 49-53 81 (Nonverbal scale) 
fo p " 54 .80 (Full scale) 
" T "u 54 71 (Verbal scale) 
a ^ i 54 .63 (Nonverbal scale) 
aa D A 332 -82 (Full scale) 
u “u + 332 -74 (Verbal scale) 


KOREA Se E. 332 .64 (Nonverbal scale) 
* Corrections are made to eliminate appe o e e 
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whole. Yet, since the aforementioned coefficients of .60, .68, and .56 are 
fairly distant from unity, the measured abilities in one group (verbal) can 
be used only for a general approximation of abilities measured by the 
other group of subtests (nonverbal), and vice versa. The reporting, there- 
fore, of verbal, nonverbal, and full-scale IQs with this instrument is es- 
sential. 

Correlations with Other Scales. Since the appearance of this scale, 
Several reports have been published that deal with the correlations and 
IQ differences found between it, the S-B, and the Arthur nonverbal tests. 
The summarized data are given in Tables 11.8 and 11.9. 

The data in Tables 11.8 and 11.9 are for the entire group in each in- 
Stance. At different ages, the correlations between S-B and full-scale IQs 
vary from .75 to .go; for the verbal scale, between -65 and .go; and for the 
performance scale, between .50 and .75. The table giving mean intel- 
ligence quotients and standard deviations indicates that the Wechsler 
Scale tends to rate subnormal subjects somewhat higher, but not markedly 
So, than does the Stanford-Binet. At the average level, the reverse is true. 


TABLE 11.9 


IQs or INTELLIGENCE SCALE FOR CHILDREN COMPARED WITH 
Two OTHER SCALES 
MEANS AND STANDARD DEVIATIONS 
(5 STUDIES) 


SEA a Oy Nene a RENE E ERE 
Arthur point A 
WISC scale S-B Subjects N 
60(SD6) Full 
65(SDI3) Verbal 
58(SD10) Perform. 


66(SD9) Full | 


65(SD12)  56(SD5)  Deficients 40 


67(SD7) Verbal 65(SD7)  Subnormal 70 


72(SD11) Perform 
100(SD15) Full 
99(SD14) Verbal 
101(SD15) Perform 
102(SD11) Full 
101(SD12) Verbal 
104(SD11) Perform 


101(SD13) Full | 


95(SD16) 105(SD15) Normal 49-53 
j 106(SD11) | Normal 54 


103(SD14) Verbal 108(SD16) Normal 332 


98(SD15) Perform. 


Sa EE ee ee ee 
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On the basis of the research thus far reported, it is reasonable to con- 
clude that full-scale intelligence quotients and verbal-scale intelligence 
quotients, on the one hand, and Stanford-Binet IQs, on the other, have 
considerable communality of psychological functions being measured. 
The performance-scale intelligence quotients have much less in common 
with the Stanford-Binet. 

A number of other comparative studies have been made, using group 
tests of both the verbal and nonverbal types. Full-scale IOs of the WISC 
correlated with these as low as .61 in some instances, and as high as .91 
in others (4, 59, 67). 

In each study, the coefficient must be viewed in the light of the number 
of individuals tested, their age range, and their range of ability. 

Predictive Efficiency. Validity of a test for school children, es- 
pecially, should be evaluated in terms of its predictive value in regard to 
school achievement. Table 11.10 summarizes the results reported in sev- 


TABLE 11.10 


CorrELATIONS or WISC SCORES WITH SCHOOL ACHIEVEMENT 


Range of r’s for 


Scale N separate school subjects Total achievement test 
Full 54 45-71 76 
Verbal 54 48-60 ‘62 
Nonverbal 54 41-.64 65 
Full 18-21 44-81 = 
Verbal 18-21 47-74 

Nonverbal 18-2] -29-.74 X 
Full 621 -66-.67 

Verbal 621 58-62 = 
Nonverbal 621 52-63 n 
Full 51 — 

Verbal 51 = "i 
Nonverbal 51 


54 
2 TP LA pico s ot st AT a PS 


eral studies in which WISC scores were corr 
ment-test scores. 

On the whole, these results compare favorably with the range of cor- 
relation coefficients found for other widely used tests of intelligence Addi- 
tional systematic validation studies would be valuable at each age and 


grade level, and for each of the several mental levels (for example gifted, 
superior, average, slow, mentally deficient). pS 


elated with objective achieve- 
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Evaluation. This scale is a significant addition to the limited 
number of instruments available for individual testing. Although one ad- 
vantage originally claimed for the WISC was that it did not use the 
mental-age concept, it was subsequently found desirable to supply mental- 
age equivalents (71); for this concept is a highly useful one when inter- 
preted by a qualified psychologist. 

The relatively low reliabilities of some of the subtests indicate that con- 
siderable caution must be used in utilizing a test profile for diagnosis or 
guidance. The full scores and the total verbal and total performance 
scores, however, have yielded reliability coefficients at a satisfactorily high 
level, 

More systematic research is needed on the scale's predictive validity 
for educational purposes. Available data show that 1Q differences between 
the WISC and the S-B are significant enough in some instances to warrant 
caution in using the two scales interchangeably in every situation. 

The limits of the IQ values given by the WISC full scale are from 46 
to 154. This means that the scale is not as accurate when used with indi- 
viduals who rank above or below these limits, as when used with others, 
The total number of such individuals is small; but in particular in- 
Stances this can be a serious limitation. 
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INDIVIDUAL PERFORMANCE SCALES 


Definition and Need 


; . A R in 
A performance scale is one in which language is used only 
the instructions, [^ 


ERU: $ : ime. 
T not at all when directions are given in pano maa 
The task to be performed requires an overt motor response other th: 
verbal. The Principal cha isti 


: ta 
tformance test is tha 
response to, 


$ “ner 
formance scale” (or Ed 

nonverbal test" 1 Although 
term “performance” may b : 


sler 
est items in the Wechs! 


in- 
erfor; they: 

avenue : leading in the scales. m EO - m : r 
\n individua may Verbalize the pr i pt 
facilitate his Fesponses; but the pene gn ate Cape: 
language. - 


izat? H in an 
Verbalization ig not required in 
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as revealed in the course of the examination (18). Unlike tests that were 
developed subsequently and that are now in use, those of Healy and 
Fernald were not actually standardized in respect to administration and 
scoring. This group of tests provided the psychological examiner with 
situations wherein he could observe, evaluate, and interpret the testee's 
methods of solving problems and his behavior in test situations. The 
Specific tests were selected on the basis of Healy's and Fernald's judg- 
ment and psychological insights as to what constituted intelligent ac- 
tivity; beyond this the value of the results obtained with their tests de- 
pended upon the clinical acumen of the examiners, since there were no 
norms based upon standardization procedures.? 

Performance tests have proved most valuable when used with persons 
handicapped by language disabilities, such as the deaf, the foreign- 
language-speaking groups, the illiterate, and those who have speech or 
reading disabilities. They are valuable, also, in helping to identify chil- 
dren who are inarticulate or excessively shy because of emotional rea- 
Sons and who, therefore, might appear at a disadvantage on verbal tests 
of mental ability. 

_ Used in conjunction with the verbal type, performance tests are helpful 
m identifying, with increasing certainty, the mentally deficient and the 
mentally retarded, In cases involving diagnosis of mental deficiency, it is 
often desirable to supplement the Stanford-Binet, the Wechsler, or other 
verbal scales, with performance tests, in order to determine whether the 
language factor, cultural handicaps, or poor education, may have ad- 
versely affected the testee's score on the types of test materials included in 
the Verbal scales. If a significant difference is found between the two ob- 
tained Tatings, further study of the individual is indicated. 

hen performance tests were first used, it seemed as though they 
Would be fairer and more appropriate for testing children and adoles- 
cents from culturally underprivileged environments who, therefore, might 
be handicapped in taking verbal tests. : 

Originally, then, individual performance scales were devised as sub- 
Stitutes for the Stanford-Binet. (At present, however, the more general 
View is that they should be regarded as supplements to scales employing 
largely verbal and numerical materials. The principal reason for this view 
1s that the correlations between performance and verbal scales are only 
about .50, when chronological age is held constant.) 

At Present, judging from published case reports from schools and 
65, there is little research activity in this aspect of mental testing. We 


clini 
3 Wi; i i 
, With their performance tests, Healy and Fernald included tests of reading, arith- 
dins Word opposites, information, and others. All tests referred to are listed in the 
ferences at the end of the chapter. 


82 INDIVIDUAL PERFORMANCE SCALES 
2 


i i it i i rmance 
shall, however, present the situation as it is currently, since perfo ew 
š H 
tests are historically significant and are being used, even thoug 
are the subject of research by very few psychologists. 


Representative Scales 

ir í the 
THE PINTNER-PATERSON ScaLE. This group of performance ag 23 

first to be organized into a scale, is now of interest principally eee 

historical and background value. Acquaintance with it will also en 


subject.(They are intended primarily for use with persons having ane 
for non-English-speaking individuals.) These a E 
€ as supplements to verbal scales an 


» have speech defects or 
reading disabilities, 


The subtests in the scale are described below. 


1 the picture-puzzle type, is E 
» in color. Sections of the board are remove 
ust replace them correctly.(Score is based on 
wrong moves.) 


“6. Triangle Test, Four triangular pieces are to be fitted into the board. 
Score is based on tim 
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Fic. 12.1, Pintner-Paterson performance tests. C. H. Stoelting Company, 
by permission. 


8. Healy Puzzle A. This consists of five rectangular sections to be fitted 
into a rectangular frame. Score is based on time required and number of 
moves made. 

9. Manikin Test. Wooden legs, arms, head, and body are to be put to- 
Bether to make the form of a man. Score depends on quality of performance. 
XA. Feature Profile Test. Wooden sections have to be put together to 
form the profile of a man's head. Score is based on time required. 

11. Ship Test (originated by H. A. Knox). This is a picture of a ship cut 
into ten sections, all of the same size and shape, to be inserted properly in 
à rectangular frame, Score depends on quality of performance. : 

“42. Healy Picture-Completion Test I. This isa large picture from which 
ten small squares have been cut out. The missing parts are to be selected 
from among forty-eight squares identical in size. Score depends on the 
quality of completion within a limit of ten minutes. 1 

13. Substitution Test. A page of rows of geometric figures (five different 
shapes) have to be marked with appropriate digits, to correspond with a 
key at top of page. Score is a combination of time and errors made. 

14. Adaptation Board. This is a formboard having four Circular blocks 
and holes; three are 6.8 cm. in diameter, and the fourth is 7 cm. The subject 
is shown that one block fits the larger hole. He is then required to keep his 
attention fixed and to fit this larger block into the correct space when the 
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i umber 
board is moved into four different Positions. Score is based on the n 

t moves. 1-1 
AT Four cubes (one inch) are placed before the Be d 
RR in a specified order by the examiner with a fifth c on 
subject is asked to imitate the order of tapping. The sequence be 


i ctly imi- 
longer and more complex. Score is the number of sequences correctly 
tated. 


i scale 

For general testing Purposes, the authors of this performance = 
ale that includes ten of the fifteen B : 
5: 9, 10, 11, 12, and 15 of the preceding n 

is fr ars tO 15. 
The age Tange of the Pintner-Paterson scale is from 4 years E 
hat every test in the series has Qoo s 
ple, the Sequin Formboard is not er ic 
A le, d is 
ordinarily, beyond age 10, while the Feature Profile Test is generally 
useful below age 10. 


; tests 
THE ConNzLL-Coxr SCALE. For their scale, these authors selected 


est ma- 
vari Thi i ven types of test 
from a ariety of sources, This scale includes. seven typ AT 


T TIS in the 
terials, two of which (Manikin and Digit Symbol) are utilized in ks 
Pintner-Paterson and will, therefore, not be repeated here. The pena 
ing five types are the following. (Note that the Cornell.Coxe does not 


1° Block Designs. These are the famili 


ar Kohs colored-block designs, 
five of which were included, They are sc 


ored for accuracy and time re- 


n of eight tes 
Knox Cube; Seguin, Two-Figure 


Point SCALE. This scale provides two forms. Form I isa 
restandardizatio in the Pintner-Paters 
intne 
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Feature Profile; Mare and Foal; Healy Picture Completion I. Two tests, 
the Porteus Maze and The Kohs Block Design, were added. 

The Porteus test consists of a series of mazes of increasing difficulty, each 
printed on a separate sheet. The subject is required to trace, with pencil, 
the course from entrance to exit. The Kohs test consists of the same set 
of blocks used in The Wechsler-Bellevue scale, but the subject reproduces 
different designs. 


Fic. 12.2. Porteus Maze tests—years 5 and 14. C. H. Stoelting Company, 
by permission. 


Form II serves as an alternate when retesting is necessary. This version 
Utilizes four of the test types already described. These are Knox Cube, 


Seguin Formboard, Porteus Mazes, and Healy Pictorial Completion IL. 
The only new type of material not thus far described is the Arthur Stencil 
Design (see Fig. 12.3.) This test employs twenty designs, presented singly, 
that are increasingly complex and more difficult to reproduce.) The testee 
is given six square, colored cards and twelve colored stencils that are cut 
Within Square cards. Each design is to be reproduced by placing the ap- 
Propriate cards and stencils one upon another, so as to duplicate the 
original in both form and color. For example, a practice design requires 
merely that à red octagonal stencil be laid over a white card to get the 
desired result, i E. 
Formpoarns. Several performance tests serve a specific and limited 
Purpose by using formboards only. The Ferguson Form boards (1920; re- 


LES 
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à ivence 
ment SRST), teachers’ estimates of PNE 
i :56); and they were largely utilized, E 

ad come to a school guidance clinic 


parently, with individuals who h 
assistance, 


} idely 
The Kent-Shakow Formboard Series (1952) is probably the most wi 
known. Since the origi 


would measure 


visual analysis and synthesis 


OTHER TYPES 


provides a means of observing the 
subject's modes of dealing with a 
problem. 

The Carl Hollow-Square Scale, in- 
tended for use primarily with adults, 
is of interest because the problems of 
analysis and organization it presents 
are quite complex for this type of 
test; that is, more than the usual 
number of variables are involved in 
each situation. The blocks are of 
varying sizes and forms, having 
straight and beveled edges; and they 
are truncated in several ways. The 
Series of twenty tasks becomes pro- 
gressively more complex and difficult. 

Since the subject must follow dif- 
ferent and more complex instruc- 
tions as he progresses with the tasks 
in the scale, Carl believes that audi- 
tory memory span is involved, as well 
as the mental operations necessary in 
other performance tests. He believes 
that although this formboard test 
measures mental operations involved 
1n concrete and practical aspects of 
activity, it is also a measure 


Cause of its significant correlations with verbal tests ( 


Other Types 


The Leiter International Performance si 
t ry since, for the most part, it does not employ the 


Others in this category since, 
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SUBTEST A 


$08 


SUBTEST B 


$0088 


SUBTEST C 


28885 
a] RS] RS] PS P 
OCC 


SUBTEST D 
Fic. 12.4. Modified Kent-Shakow 
Formboard Series. By permission of 
William R. Grove. 


of general rather than special ability, be- 


r= .50 to .80). 


ce Scale (1948) differs from 


intended to be a nonverbal scale to 
ch, it utilizes materials of the kind in- 


usual types of test materials. It is 
measure general intelligence. As su i 
cluded in: G) nonverbal group scales (of the paper-and-pencil type); (2) 
forms of materials included in verbal tests; and (3) a few of the con- 


ventional performance tests. In the Leiter scale, however, all tests are 


presented through a nonverbal medium. Items representing (1) above are, 
for example, concealed cubes, matching pictures and forms, picture and 


*In 1939, W. R. Grove made available a modified form of the Kent-Shakow series, 


Industrial Model. See Figure 12.4. 
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imilariti r series, and 
form completion. Under (2), above, are similarities, number mi. wi 
i i i á ts r = 
classification of objects. Under (3) are included a numberof — eo 
ing matching of designs and colors, completing block designs, 
completion (see Figs. 12.5 and 12.6) 
F 


in di inni and con- 
(This series of tests is graded in difficulty, beginning at age 2 ar 


- use witl 
tinuing through age 18. It is thus intended, presumably, for use with 
adults as well as with children.5 


Fic. 12.5. Selection of analogous designs, 


9-year level. From The Leiter Inter- 
national Performance S 


cale. By permission. 


Studies of reliability have yielded quite satisfactory results, the co- 
efficients being in th 


» -40; performance scale, 79: 
a serious question regarding 

parability of mental functions required by these 
two scales, 


“An adaptation for children has been prepared by Dr. Grace Arthur. It was devised to 
Measure ability of children from 3 to 8 years of age (4, p. 4). 
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One of the advantages often claimed for performance and other non- 
verbal tests is that they are "culture free.” There are, however, no culture- 
free tests. Examination of the content of the Leiter and other nonverbal 
scales shows clearly that the specific materials included are derived from 
our culture. The aim of these and other tests intended for the general 
population can only be “culture fairness”; that is, to give no segment of 
the population advantages over others. 


m The Leiter International 
le. By permission. 


Fic. 12.6, Classification by genus, 5-yCaT level. Fro 


Performance Sca 


. The Goodenough Drawing Test (1926) is intended to evaluate a child's 
intelligence by means of his drawing of a man. It is used with children 
from the age of 315 to 1814 years. The child is instructed to make a 
picture of a mañ as best he can. He is told to work carefully and to take 

n esthetic quality but, rather, upon the 


his time.(Scoring is based not upo t 1 L 
presence of essential details and the correctness of their relationships, 
Which presumably indicate the individual's level of perception and analy- 


sis of a familiar object in his environment. ; 

A revision of the Goodenough test is now being prepared for pub- 
lication (16, 17). This revision is based upon the same principles as the 
original; but the standardization process was more thorough and, as a re- 
sult, the norms (for ages 5 to 15) should be more reliable. An innovation is 
this: the child is asked to draw a picture of a woman and one of himself, 
in addition to that of a man. Also, consistent with the current trend, 
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iati = 100; 
point scores on this test are converted to deviation IOs (mean ; 
SD = 15). jum 

Om coefficients of the Goodenough test vary from pe 
SIS; dependia d 90, depending in part upon the method used. Agreemen avs = 
different scorers of the same drawings is high, with correlations 
-90. bo. d : "n 
yon studies, in terms of correlations with intelligence v e 

Ide j i inci i anford- 
yielded varying results, Coefficients—found principally with c = ae 

ine i cour! 
Binet—are between -40 and .80. Some of these coefficients, o > 


Man Test, however, is a 
mental retardation is suspected. 


he Columbia Mental Maturity Scale (1954; revised norms, 1959), de- 

signed for the mental age range of 3 to 12 years, is intended primarily, 
though not solely, to test ntal abilit of children h: 
cerebral palsy or other defects verl ctioning. 
other performance tests, this scale may be used in checking on ratings, of 
linguistically handicapped children who 
using verbal materials. The child’s respon 
Pointing, or to pointing Plus any verbal additi 
make. 

This scale consists of one hundred 6. 
lems graded in difficulty, each With a 


by 19-inch cards presenting prob- 


€se tasks require what Spearman 
student will find that this form of 


duction of relations.” The 
€ quite simple to th 
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Fic. 12.7. Test items from the Columbia Mental Maturity Scale. Reproduced 
by permission. 


ò This technically, well-conceived scale demonstrates that with children 
it is possible to obtain a reasonably dependable rating, when necessary, 
through the use of a single type of material of a sound kind. This is not 
to say, however, that under normal conditions a single-type scale should 
be used in preference to one, like the S-B or the WISC, that employs 
Several types of test materials, 


Functions Tested by Performance Scales 


i This discussion will supplement the analysis that was presented 
in connection with the nonverbal parts of the Stanford-Binet and the 
Wechsler scales. The reader should refer to those for more detail. 

(Since all performance tests involve visual perception and manipulation 
of objects, the number of types of items is relatively limited. It is not 
Surprising, therefore, to find that the range of psychological functions is 
also restricted. This is one reason why the correlations between per- 
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formance scales and scales of the Stanford-Binet type are not higher than 
they are, since the latter can sample a much wider range of mental opera- 
tions. E : 

If the reader will re-examine the descriptions of the fifteen subtests in 
the Pintner-Paterson scale, and the few other types introduced in later 
scales, it will be readily apparent that, except for the Goodenough test, 
they may be classified in one of the following categories: 


geometric formboards, with variations, from the very simple to rather 
complex. 


picture formboards (also known as picture completion) of various degrees 
of complexity. 

block designs from simple to the complex, 

recall of geometric designs. 

picture arrangement, 

block building. 

cube sequences (imitating the order of tapping a series of cubes). 

digit symbol. 

mazes from very simple to the complex. 

matching forms, 


perception, plus visual insight requiring 
ance on all types is a measure also, in NERS 
ing degrees, of motor speed. Performance is facilitated by visual imagery, 
that is, by ability to analyze and synthesize a pattern imaginally before 
actually going through the movements. In all performance tests, visual- 


motor integration, affecting the speed and accuracy with which a person 
~ ‘gration, affecting the sp uracy with w] 


responds, is a factor, 


analysis and synthesis, Perform 


ures of general Capacity, 
above ‘average level, 
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Evaluation of Performance Tests 


Validity. Although performance tests were originally devised 
to serve as substitutes for verbal scales, comparative studies have shown 
that itis sounder to regard them as supplements. The reason is that when 
the factor of chronological age is held constant, in almost every instance 
the coefficients of correlation fall at .50, or lower. Hence, although verbal 
and performance tests measure some functions in common, or are in other 
ways interrelated, each type also measures functions different from those 
of the other. 

The Pintner-Paterson scale, for example, yielded coefficients of .43 and 
238, respectively, when correlated with the S-B, for a group of dull children 
and one of superior children (28). The r of .23 indicates that these per- 
formance tests are inadequate for differentiating among levels of superior 
individuals. f { 

A fairly large number of studies have been published reporting higher 
correlations between performance-test ratings obtained with the Pintner- 
Paterson and similar tests and those obtained with revisions of the Binet 
scale. Coefficients in the range of .7o and .80 were not uncommon, 
whereas others were as low as .50. These coefficients, however, cannot be 
interpreted as necessarily indicating considerable community of function 
between these types of tests. It appears that to an appreciable degree the 
correlation coefficients are the result of the wide age range of the subjects 
tested, with the result that the coefficients reflect the fact that the psycho- 
logical functions being tested by both types increase with age. An or- 
dinary group of 10-year-old children will get higher scores on both types of 
tests than will a similar group of g-year-olds, who, in turn, will score 
higher on both than an ordinary group of 8-year-olds; and so on. This is 
to be expected; for the tests have been so constructed as to yield progres- 
sive increases in age norms as chronological age increases. f j 

Another example of the effect of age range on correlations is found in 
the correlations between intelligence ratings and height, or weight, or 
dentition. For a wide age range, these are in the neighborhood of .50 and 
‘60, because older children generally are taller, heavier, and have more 
permanent teeth. But within a single age group the correlation coefficients 
between these physical traits and intelligence-test ratings drop to negli- 
gible levels. Similarly, when age is held constant, or very nearly so, the 
correlation coefficients between results obtained with the Pintner-Paterson 
and similar performance tests, on the one hand, and verbal tests, on the 


other, drop to between .40 and .60. 
Cornell and Coxe, who took the position that a performance scale 
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should be supplemental to the verbal type, report a correlation of .79 Rm 
the S-B, for a wide age range. Yet when CA was held constant, the partia 
correlation coefficient dropped to -38. Intercorrelations of the parts of 
their scale varied from .50 to -75, over a wide age range; but, again, with 
CA constant, the coefficients dropped to .20-.60. 2 

Since the Arthur scale was devised as a substitute for the Binet re- 
visions, in cases where a verbal scale is inappropriate, it is pertinent to 
examine the correlations found between the two scales. Table 12.1 shows 


TABLE 12.1 


CORRELATIONS OF STANFORD-BINET IQs 
AND ARTHUR IQs 


Age N T 
5 35 .70 + .06 
6 54 77 + 04 
7 50 68 + .05 
8 44 74 + .05 
9 41 80 + .04 
10 40 51+ 08 
11 44 68 =+ .05 
d 31 80 + .04 
15 27 21 2-.12 
14 27 07 + .18 
15 16 —.10 2.17 


ages of 5 and 1» years, 
€ver, are so low that th 
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the age range of 5 to 12 as a supplement to verbal scales of the Stanford- 
Binet type. 

This conclusion is further supported by validating data obtained sub- 
sequent to the publication of Arthur's manual. These later studies, using 
both the 1916 and 1937 revisions of the Stanford-Binet, can be sum- 
marized as follows: 


Stanford-Binet and Arthur IQs correlate variously, from about .50 to 
about .80. 

In a large majority of cases, the Arthur scale IQs tend to be somewhat 
higher than the S-B at levels below go IQ.9 

At the levels above 9o IQ, the S-B tends to yield somewhat higher ratings. 

The means of the differences between the IQs of the two scales have been 
found to range from about 5 to 10 points. 


Dr. Arthur's position appears to be that the appreciable extent of agree- 
ment between S-B ratings and those of her performance scale indicates 
rather even development and manifestation of psychological functions in 
general. She believes that if the results of these two scales disagree sig- 
nificantly in the case of a given individual, this is due to unevenness in 
development and expression of functions, or to some complicating non- 
intellectual factors. 

In the case of any individual, the actual interpretation of performance- 
test results, taken in conjunction with verbal-test findings, will depend 
upon all information available with regard to the person concerned and 
upon the interrelations of all relevant facts and data. If the two scales used 
Bive discrepant results, the psychologist’s task is to discover the reasons 
for whatever significant differences are found. His analysis and inter- 
pretation of test results are thereby enriched and have greater validity. 

The correlation coefficients cited above for the three performance scales 
are representative of those generally found with this type of testing ma- 
terial. A factorial analysis of results obtained with thirty-four commonly 
"sed performance tests suggests one reason for the low or only moderate 
Correlations between these and verbal tests (29). This analysis indicates 
that the principal factors measured by the performance tests studied may 
be identified as perceptual speed, spatial, and induction. Perceptual speed 
is defined as the readiness to discover and identify perceptual detail 
(mainly visual). The spatial factor is the ability to manipulate objects in 
Space. Induction, of course, means reasoning from the particular to the 
Beneral. Although the first two of these factors are involved to some ex- 

"Dr. Arthur reported that for 435 clinic cases who had S-B intelligence quotients of 


less than 95. there was no group trend in the direction of higher ratings on her scale 
EI revised, P- 14; also 3). 
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tent in many verbal tests of mental ability, they are of relatively minor 
significance in the determination of an individual's rating on these. 
Reliability. So far as reliability is concerned, performance 
tests stand up reasonably well individually, although, as a group, not as 
well as the better verbal scales. The Cornell-Coxe scale, for example, re- 
ports coefficients for each of the parts varying from .66 to .89, while for 
the total score the reliability was :929. Forms I and II of the Arthur scales 
were correlated at each age level from 6 to 16 years. The coefficients, with 
CA constant, varied from -55 (PE = .06) at age 8, to 70 (PE = .o6) AIO 
and 15. The median coefficient was .61 (PE + 06). These low coefficients 
are possibly evidence of the nonequivalence of the two forms rather than 
of the unreliability of either. In another study, 61 institutionalized men- 
tally deficient boys (mean Stanford-Binet IQ= 67) were tested with the 
Arthur scale and retested after two years. The reliability coefficient was 
-85 for total Scores, and .69 to .80 for part scores. However, the mean 


loss of only one point on the S.B during the same interval (30). 

The mean gain of ten points may be attributed to one, or both, of 
ype of test are more susceptible 
ose on the verba] type; or (2) residence 


Paterson and other per- 
used for clini 1 oses. 
ons of ther; and experimental purp > 
of them have been made. They are mor 
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Performance tests, as already stated, are limited in range of mental 
functioning tested; because of this, they do not differentiate well among 
above-average individuals. {These performance scales are most useful, 
on the whole, at the lower age levels and the lower mental levels, as well 
as for testing persons who have language handicaps. > 

Geometric formboards, picture-completion boards, object assemblies, 
etc., are within almost universal experience of American school children; 
and since these tests make demands upon the mental processes already 
indicated, they may be regarded, in some degree, as measures of intel- 
ligence at these earlier age levels. However, as performance tests go higher 
in age and difficulty levels, they present specialized problem situations 
which the subjects have not had comparable backgrounds for handling. 
This is especially true of the more complex and subtle formboards (for 
example, the Carl hollow square), performance on which is facilitated by 
training and experience in tasks requiring spatial perception, as in some 
types of engineering, cabinetmaking, and the like. Since these conven- 
tional performance tests do not require much use of the ability to make 
abstractions and to deal with concepts, they fail to measure some of 
the most important aspects of mental activity. 

Among the advantages reported for performance tests are these: (1) 
Since the tests do not require the use of language, individuals do not 
"block" as a result of feelings of inadequacy resulting from lack of formal 
schooling; (2) since all elements of the problem are visually present, some 
individuals proceed with greater confidence. 

P: sychologists are agreed that, where indicated, the use of performance 
scales can provide more information than only a rating in the form of a 
numerical index. (These tests provide an opportunity to observe qualita- 
tive aspects of behavior under standardized conditions in a variety of 
Problem situations, A subject's approach to a problem might reveal, for 
€xample, a state of depression or agitation; hesitation or impetuousness; 
thoughtfu] deliberateness, bull-headed persistence, or easy discourage- 
ment; an insightful approach or one of haphazard trial and error.) 

The reader will have noted that not all authors of performance tests 
agree as to what the characteristics of such a scale should be(The Arthur 
scale was constructed to provide a nonverbal substitute for the Binet 
revisions. Hence it was expected to have a highly significant correlation 
With these revisions, and standardization proceeded on that principle) 
The Cornell-Coxe scale is intended not as a substitute for Binet revisions 
but as à supplement to them. Thus, this performance scale was con- 
Structed on the principle that there should be a relatively Jow correlation 
between it and the verbal type of intelligence tests. The Leiter scale, on 
the other hand, differs from these and other conventional performance 
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tests in that, for the purpose of measuring much the same mental proc- 
esses as do the verbal tests, it employs a rather distinctive technique with 
a variety of materials, some of which have been adapted from nonverbal 
group tests of the paper-and-pencil type, while others have been specially 
devised. 

Experimental evidence indicates that the usual performance scales (for 
example, Pintner-Paterson) are best used as supplements to verbal scales. 
The former are instruments with which we may test development of in- 
sightful behavior involving visual perception, instead of by the use of 
symbols (language and number) that are essential for abstractions, con- 
cept formation, ideational reasoning, and ability to deal with problems 
extending beyond one's immediate, concrete environment. 
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I5. 


SCALES FOR INFANTS AND 
PRESCHOOL CHILDREN 


In this chapter we shall present several representative scales 
devised to evaluate mental development of children ranging in age from 
one month to six years. Some of these scales, for the most part, are not 
tests as that term is commonly understood. They are, rather, norms and 
inventories of development and behavior, grouped at their respective 
average age levels, derived from observation of children's behavior and 
from experimentation in a variety of situations. All are administered 
individually. The content of these scales is presented in some detail, so 
that the student may become acquainted with the scales’ characteristics in 
order to make meaningful comparisons with tests used at later ages. 


The Gesell Scales 


The Gesell scales provide tests at two levels; one is The Infant 
Schedule and the other is The Preschool Schedule. They are the products 
of systematic study of infants and preschool children at the Yale Clinic of 
Child Development. The first schedule (1925) provided rather crude 
norms at the following age levels: 4, 6, 9, 12) 18, 24, 36, 48, and 60 
months (9). 

At each level the inventory was divided into four categories of be- 
havior: (1) motor, (2) adaptive, (8) language, (4) personal-social. Although 
the normative schedules themselves have undergone considerable re- 
vision and refinement since their first appearance, these four categories, 
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with some minor variations in terminology and analysis at times, have 
remained throughout. Motor behavior is said to be of value “. . . be- 
cause it has so many neurological implications, and because motor — 
ities of the child constitute the natural starting point for an estimate oF 
his maturity.” In adaptive behavior “. . . we reckon with the mage 
sensori-motor adjustments to objects and situations: the coordination 9 
eyes and hands in reaching and manipulation; . . . the capacity to ini- 
tate new adjustments in the presence of simple problem situations which 
we set before the infant.” Language behavior, broadly used, includes 
". . . all visible and audible forms of communication, whether by facial 
expression, gesture, postural movements, vocalizations, words, phrases, or 
Sentences. [It] includes mimicry and comprehension of the communica- 
tions of others." Personal-social behavior "comprises the child's personal 
reactions to the social culture in which he lives [bladder and bowel control, 
feeding abilities, sense of Property, self-dependence in play, cooperative- 
d social conventions]" (11, p. 5) 

d for the examination of children be- 
- At the 4-week level, the inventory of 
d control, arm-hand posture, leg-foot 


A : 1 
y posture and Progression, regard, prehension, language anc 


social behavior, At the 56-week le 
and progression, prehension, 


age, they have been des- 
» if at ascending ages there is a 
ants showing that behavior; (2) 


percentage giving that response. 
vior items were allocated to age 
ncy. The "focal" behavior items 
€ most frequently observed? 

lus or minus, depending upon 
merated behaviors. The score on 
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TABLE 13.1 


Danciinc Rinc BEHAVIOR (4 WEEKS-28 WEEKS) 
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RD Behavior items «|els | 12 | 16 | 20| 24 | 28 
l Regards after delay 77 | 54| 64|6527|13 | 14 5 
? Regards immediately 96 | 46 | 36 |35 |68|97| 96| 95 
3 Regards momentarily 53 |, 85 | WE) 88) $5 1 — = 
4 Regards prolongedly 47| 43 | 29 | 62 | 87| 47| 38 5 
5 Regards consistently —| —| —|—] 17] 26} 59] 90 
6 Disregards in midplane 771 89 | 46 | 46 | 14 |i— |) a = 
7 Regards in midplane 99 | 61] 54|54|86| —| —| = 
8 Regards in midplane 

(long head) 22 | :25 12. |.50: |. 88. |: —)) =). 

9 Regards in midplane 

(round head) 32 | 75]| 70 | 56°) 88 |=), hi — 
10 Regards ring in hand =| —|-—]-— | 56,8300) 300 
ll Regards string SN ees een 23 
12 Shifts regard 94 | 100 | 100 | 96 | 93 | 46 38 | 41 
13 Shifts regard to surroundings 75| 68| 61|35 13|16| 14) 5 
14 Shifts regard to Examiner's 

Hand os | e¢| 6€1|77/48|— —| — 
15 Shifts regard to Examiner 41 | 54| 57| 65 | 64) 27) 24) 27 
16 Shifts regard to hand 0 4 Te Rd ELA S Sit 
17 Follows past midplane 44 | 62| 50) 58) 84)—| — | — 
18 Follows past midplane (lg. h.) 20; |.88; | 250/037 83. |. esa): s 

19 Follows past midplane (rd. h.) 55. | 753 eg 677] S27a| catal ma 

3 Follows approximately 180° 16: 2497 [hy.40'1| M50. 6S | eat ae 
than op , i 0° 

BR approximately 18 ol n| s5|2/8|—| —| — 

22 p " o 

ie approximately 180 se se | va [erem | [e] 

?3 Approaches 0 fu M 4 z * = 

24 Approaches after delay 3 pozo A lok z e| sı| 91 

25 Approaches promptly Fel T E "m GA iiu pee 

de Arms increase activity " , pA 15 |17 | 19 A 

Arms separate 

28 E with one hand 0 0 = b * » E g 

29 Approaches with both hands on SOL END i: 44 60] 54| 14 

7 Approaches with arms flexed 4 ; t 8/20|38| 1 5 

Hands come together 

32 Contacts ring s 4 i 1 s. 2: E: P E. 

2 C ea ring on contact 1 ò ol 8|22|73| 96] 100 

is c eee ee S RE ecu a 


Grasps after delay if grasps 
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TABLE 13.1 (Continued) 


20| 24 | 28 
RD Behavior items AEASESEZ | 16 | 42 
—|— |61| 45 
36 Grasps interdigitally | E so|18| aol Bs 
37 Retains entire period 7— 3 zal EL spill sella | Wee 
38 Holds with both hands =| =| —|— PAIE In 14 
39 Hand opens and closes on nng |—]| —| —| — n | gal agi va 
40 Brings ring to mouth ee CLA em ss rl ze da 
41 Free hand to midplane —|—|—|— : ri di 74 
A Ha T 78 | 56 s C 
44 Drops immediately SI m x i alate 
45 Regards dropped ring if drops -—] S a S1 ie | ee 1 200 
46 (If drops) pursues droppedring | | —| _ | _ 7 
47 (If drops) resecures dropped FEES: 
ring A ees em 35 
48 Rolls to side 3 4 8| 4|(35|42]| 88 
49 Frets 


» 21, 24, 30, 36, 42, 48 


cane ER month the schedules at the two 


extremes are given, 
15-Month Level- 


Steps, starts and Stops 
ollapse 
* has discarded creeping 
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Stairs: creeps up full flight 
Cubes: tower of two 
Pellet: placed in bottle 
Book: helps turn pages 

Adaptive: 
Cubes: tower of two 
Cup and Cubes: six in and out cup 
Drawing: incipient imitation stroke 
Formboard: places round block 
Formboard: adapts round block promptly 


Language: 
Vocabulary: four-six words or names 
Jargon: uses 
Book: pats picture 
Picture card: points to dog or own shoe 


Personal-Social: 
Feeding: has discarded bottle 
Feeding: inhibits grasp of dish on tray 
Toilet: partial toilet regulation 
Toilet: bowel control 
Toilet: indicates wet pants 
Communication: says "ta-ta" Or equivalent . 
Communication: indicates wants (points or vocalizes) 
Play: shows or offers toy to mother or examiner 
Play: casts objects playfully or in refusal 

72-Month Level 
Motor: 


Jumps from height of 12 
Advanced throwing 
Stands on each foot alternate 


Walks length of 4-cm. board 
Copies diamond 


Adaptive: 


Builds three steps 
Draws man with ne j 
Draws man with two-dimens! 
Copies diamond 
Adds nine parts to incomplete man 
Discriminates five weights, no error 
Detects missing parts of pictures 

Repeats four digits 

Gives correct number © 
Adds and subtracts within five 


", landing on toes only 


ly, eyes closed 


with cubes 
ck, hands on arms, 
onal legs 


and clothes 


f fingers on single hand and on both 
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Language: 
Binet items are used here 
Personal-Social: 
Ties shoe laces 
Differentiates a.m. and P.M. 
Knows right and left or complete reversal 
Recites numbers up to the thirties 
These schedules are 
guide intended for use i 
in respect to the four d 
EVALUATION. 


Schedule as the basis of validity. “Fundamentally the 
hedule here offered depends on the validity of the norms, 


cept of maturity level, and the justness of using a sample of the child's 
behavior to ind 


í - - Our conclusions regarding them [the 
foregoing issues] go beyond experimental data and are based on years of 
j in claiming their general soundness 
ary evidence is revealed” (12, p. 218). 
maintain that reliability cannot be 
ethods presented in connection with 
ompson state, “It is the systematic 


s” (12, p. 219). And they believe that 
Passing and failing various behavior 
tudies, are reliable to a satisfactory 

In discussing the validity of the 
collaborators do not 
ing at the several a 


ge levels, They state, however: “Application of the 
Schedules is 4 simple matter of determining how well a child’s behavior 
ts one age leye] constellation rather tha 
direct co; ni 


bout it. It amounts to matching, 
P" (10, P- 320). Performance 
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categories of behavior in terms of the four approximate age levels.3 

Inspection of the Infant and Preschool Schedules shows that each 
evaluates a combination of some aspects of mental development (as usu- 
ally understood), of motor development, sensory development and per- 
ception, and of personal habits (often called social development). 

The schedule for infants (4 weeks to 56 weeks of age) has value for 
the experienced psychologist, since it provides experimentally derived 
means for estimating, nonquantitatively, specific aspects of a child's de- 
velopment within the first year of life. But as a group of psychological 
tests, this schedule does not satisfy the demands of standardization in 
terms of norms, reliability, and validity. The population sample was 
small (forty-nine boys and fifty-eight girls) and restricted (from a homo- 
geneous middle-class background). On the positive side, it can be said 
that superior experimental techniques were used, and careful observa- 
tion, experience, and behavioral insight went into the derivation of the 
schedule. For these reasons, it is useful, when applied by skilled observers, 
in appraising an infant's developmental status as it appears at the time 
of the examination. ; 

For the same reasons stated above, the Preschool Schedule is of ques- 
tionable value. For this group of children, especially those 2 years of 
age and older, there are other scales that have been standardized and can 
be used with more confidence. With some of these, the reader is already 
familiar (for example, the Stanford-Binet); others will be described in 


the following pages. 


Cattell Developmental and Intelligence Scale 


of superior merit, covers the range from 2 


to 30 months. Its test items are adaptations of many that hs developed 
and included in earlier tests, notably those of Gesell and his associates. 
has been so constructed as to constitute 


C E 

m EN t Form L of the Sia roma tests. ae 
the ages of twenty-two and thirty months Stanford-Binet wr ue pur 
mingled with other items. Thus, using the infant test items ai e early 
months and the Stanford-Binet tests for the older oe with a mixture 
of the two between, one continuous scale from early infancy to maturity 


has been attained” (6, p- 24)- The test items are dece at bg aT 
i ach m 

as they are in the Stanford-Binet. Groupings are (UE el = uo j da i 

from 2 through 12; at two-month intervals in the second year; and a 


i i i stical about validity and reliability of 
LADO T Len xa 2) eons should not be substituted for external cri- 
re 


Sych i cive à t 
nas iem are bie Gel and his collaborators Were applying what, at present, 
would probably be called “construct validity. 


The Cattell scale, 
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27 and go months. The following three age levels illustrate the nature 
of the items and their arrangement. 


2 Months 


1. Attends voice 

2. Inspects environment 

3. Follows ring in horizontal motion 
4. Follows moving person 

5. Babbles 

Alt. a. Follows ring in vertical motion 
Alt. b. Lifts head in prone position 


io Months 


1. Uncovers toy 

2. Combines cup and cube 

3. Attempts to take third cube 

4. Hits cup with spoon 

5. Pokes finger in holes of peg board 
Alt. Picks up spoon before cup 


30 Months 


1. Differentiates bridge from tower 

2. Imitates drawing lines and circles 

3- Stanford-Binet three-hole formboard rotated 
4. Folds paper 

5. Stanford-Binet identifying objects by use 
Alt. a. Identifies picture from name 

Alt. b. Concept of one 


The scale was standardized by longitudinal testing; 1946 examinations 
were made on 274 children at the ages of 3, 6, g, 12, 18, 24, 30, and 
36 months.4 

In the process of standardization, it was Cattell’s purpose, among other 
things, to improve on earlier scales by (1) improving objective procedures 
for administering and scoring; (2) eliminating items of the personal-social 
category, which are markedly influenced by home training; (8) eliminating 
items that are indicators of large motor control; (4) providing more accu- 
raté age scaling; (5) providing an adequate age range, so that continuity 
of development can be studied; (6) providing a more nearly equal dis- 
tribution of items over the age range covered. 

* Although the scale was standardized only on these age levels, groups of items are 
Provided at certain age levels between them. The placement of items between the 
standardization levels was estimated. Dr. Cattell States, however, that the indications 
are that the scale may be used with only a little less accuracy with children between 


the standardization ages. At the same time, she urges the exercise of caution in 
Interpreting test results at ages between the standardization levels, 
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Fic. 13.1. Regards cube. Age 3 months. 

Material: A 1-inch cube painted bright 
red. 

Procedure: As the child is sitting in an 
upright position before the table, the 
cube is placed on the table within easy 
view of him. The cube may be tapped on 
the table or moved about to attract the 
child's attention. 

Scoring: Credit is given if the child 


observes the cube. His eyes must remain ! 
on or return to the cube after the examiner has removed his hand. In other 


words, the examiner must make sure that it is the cube and not his hand which 


is observed. f . 
From P. Cattell, The Measurement of Intelligence of Infants and Young Chil- 


dren, Psychological Corporation, by permission. 


The method of scoring is the same as that used with the Stanford- 
Binet. Each item is rated as either plus or minus, no partial credits being 
given. Since there are five test items at each age level, the credit assigned 


to each is one fifth of the interval covered by the particular series of 
ing a one-month interval, each item carries 


onth; when the interval is two months, the 
credit is four tenths; with a three-month interval, it is six tenths. Like 
the Stanford-Binet, the Cattell scale uses a basal age, adds the credits at 
higher levels to obtain a mental age, and from that, a ratio 1Q. 
Vauwrry anp Reuiasitity. Although the presence of sinifini in- 
crease in percent passing each item at successive ages Ne use Sd ge 
icant evidence of validity, the principal criterion was the correlation be- 


tests. Thus, in a series spann 
a credit of two tenths of a m 


Fic. 13.2. Picks up spoon. Age 5 months. 

Material: Teaspoon. 

Procedure: The spoon is placed di- 
rectly in front of the child (sitting posi- 
tion) within easy reach. 

Scoring: Credit is given if the child 
makes a definite effort to reach for and 
pick up the spoon and succeeds, but if 
the spoon is picked up by reflex closure 
of the hand on chance contact, it is not 
credited. Accurate reaching, however, is 


Not t i e. . i 
PAIT on M ATA of Intelligence of Infants and Young Chil- 
dren, Psychological Corporation, by permission. 
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Fic. 13.8. Places round block in form- 
board. Age 16 months. 

Material: The formboard is similar to 
Gesell's: It is made of a three-eighths- 
inch board 36 x 16 cm., stained dark 
green. Three holes are cut in the board 
equidistant from each other and from 
the edges. From left to right the holes 
are a circle 8.7 cm. in diameter; an equi- 
lateral triangle, with sides 9.3 cm., and 
a square with sides 7.5 cm. The inserts 
are made of wood 2 cm. thick and painted white. The circle is 8.5 cm. in diameter, 
the sides of the triangle 9 cm., and those of the square 7.3 cm. - 

Procedure: The formboard is placed before the child with the circle on his 
left and the base of the triangle toward him. The circle is placed in its recess 
and the child is allowed to take it out, then he is asked (with appropriate ges- 
tures) to “Now put it back.” 

Scoring: Credit is given if the child replaces the round block. If it is done with 
an evidently purposeful act, one trial is enough, but if there is some doubt as to 
whether or not it was a chance replacement, no credit should be given unless it is 
placed a second time, (Credit is given for replacing the round block in the re- 
versed board at eighteen months.) 

From P. Cattell, The Measurement of Intelligence of Infants and Young Chil- 
dren, Psychological Corporation, by permission. 


tween Cattell IQs, obtained to the age of 50 months, and Stanford-Binet 


IQs of the same children at the age of 36 months. These coefficients are 
shown in Table 13.2. 


If we accept the Stanford-Binet as the criterion, it appears that the 


TABLE 13.2 


VALIDITY COEFFICIENTS: CATTELL AND STANFORD-BINET SCALES 


Ages at 
No. examination Coefficients 
——————— ee OEIC ES 
42 3 mos. and 36 mos. -10 + .10 
49 6 LI “ “ 34 z .08 
44 9 LI “ Lr] “ 18 + 10 
57 IR M Md -56 = .06 
52 187 d mens s 67 + .05 
52 QE Ge ie tn O23 4102.05 
42 30 “ a a LI] 83 + .08 


Source: Cattell (6, p. 49). By permission, 
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predictive coefficients are negligible for tests given during the first 9 
months of life. In this respect they are much the same as other current 
scales. For the later ages, up to 30 months, the coefficients increase appre- 
ciably, and are on the whole superior to those found with most other 
scales designed for these age levels. 
In spite of the low predictive va 
ages, Cattell has found, from study of indiv 
be of considerable assistance to the clinician in appraising infants who 
are marked deviants from the norm. This is the case especially with in- 
fants who get a high quotient; for, Cattell reports, they have markedly 
better than average chances of earning a high rating at the age of 2 or 


à years. 
Reliability of the scale was calculated by the odd-even method and 


corrected by the Spearman-Brown formula. Coefficients ranged from a 
low of .56 = .o5 at the age of 3 months, to a high of .90-:.01 at 18 
months. The median coefficient was .86 + .og. These coefficients compare 


favorably with those found for other scales.” 


lue of the coefficients at the earlier 
idual cases, that the tests may 


Minnesota Preschool Scale 


The Minnesota scale, in two forms, is an adaptation and re- 
standarization of test items chosen from the earlier work of a number 
of psychologists, plus some original additions. It is designed for use with 


children from 18 months to 6 years. 


The scale includes the following twenty-six tests: pointing out parts 


of the body; pointing out parts in pictures; naming familiar objects; 
copying a circle, triangle, and diamond; imitative drawing (vertical and 
horizontal strokes and a vertical cross); block building; response to pic- 
tures; Knox cube imitation (tapping à series of cubes in a given order); 
obeying simple commands; comprehension (“What npud you do een 
you are hungry?”); discrimination of geometric forms; ean ot pe 
from memory, recognition of forms, color nens a sa > pic re 
puzzles (object assembly); incomplete pictures; um FOE ML 
diagonal series (more difficult object assembly); paper ‘O M 2 ities 
(verbal); mutilated pictures; vocabulary; word opposites; mieu queue 


tion of clock hands; speech (length of sentence spoken by child during 


examination). " 
The norms are so arranged that 1t bal, a nonverbal, and 
Scores for children above 3° months chage re 3 t th 1 : 3 
total score. For a child under 30 months of age only the total score is 
used because the authors of the scale were unable to Vee aut a system 
of differentiated scoring for these earlier levels. A rough analysis 1s pos- 


is possible to obtain three separate 
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sible, however, to determine whether a pronounced difference between 
verbal and nonverbal responses exists. If such a difference is found at 
any age within the range of the scale, then, as the case may be, handicap 
or acceleration, in respect to language or perceptual-motor ability, may 
be inferred. 

VALIDITY AND RELIABILITY. The manual of the Minnesota scale does 
not provide data specifically designated as evidence of validity. We may 
infer, however, that the authors regarded the following facts as their 
basis of validity (1) the adaptation and use of types of test items PM 
sidered by many psychologists, over a period of years, to have e - 
(2) a standardization group of goo children, ranging in age from 1 
months to 6 years (100 in each of nine half-year age groups), who were 
balanced equally as to sex and whose fathers were representative of the 
distribution of occupational levels in the general population. " 

In a later publication, data relevant to the scale's validity are avail- 
able (15). Children who had been tested originally with the Minnesota 
scale at various ages during their preschool years were retested with the 
1916 Stanford-Binet in some instances, or with the 1937 revision in 
others. The intervals between tests and retests varied from a few months 
to about ten years. 

When the 1916 revision was used in retesting and the results Were 
correlated with the total scores of the Minnesota, the range of coefficients 
for the various groups was from a low of .25 to a high of .75. , 

When the 1937 Stanford-Binet was used in retesting, the correlations 
with original Minnesota scale total scores yielded coefficients from .15 
to .76. 

Table 13.3 shows the median correlations between orginal Minnesota 
IQ equivalents, found at various ages, and the retest Stanford-Binet IQs 
at ages ranging from 414 to 1314 years. 


TABLE 13.3 


CORRELATIONS BETWEEN MiNNEsOTA PRESCHOOL AND 
STANFoRD-BineT IQs 


a ue Se nfi ras 


Age in months at Correlations 
taking Minnesota* 1916 S-B 1937 S-B 

Under 36 45 21 

36-47 .64 61 

48 and over 65 68 


SSS ee o eee 
* The number of cases in each group was large, 

ranging from 141 to 841. 

Source: Goodenough and Maurer (15, Part II). 
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If the S-B scales are accepted as significant criteria of validity, as they 
have been by many psychologists, then the conclusion follows that the 
Minnesota scale has low validity below the age of 36 months, but that 
it has much higher validity for children above that age. 

In this connection, two considerations are relevant. First, it has been 
found that all scales devised for use with children below the age of 18 
months show a low or moderate correlation with retest results in later 
childhood. The probable reasons for this fact will be presented later. 
Second, correlation coefficients between results of testing and retesting 
the same subjects with even the same scale tend to decrease somewhat as 
the time interval between examinations increases. 

Reliability data of the Minnesota scale are variable at different age 
levels. Coefficients between scores on the two forms, with intervals of one 
to seven days, were .68 to .94 for the verbal tests, .67 to .92 for the non- 
verbal tests, and .80 to .94 for the total scores. The average reliability 
coefficients for a single form, within an age range of 6 months, were 
.86 for the verbal, .82 for the nonverbal, and .89 for the total scores. 


The Merrill-Palmer Scale of Mental Tests 


re based up on 631 cases rang- 


Although the norms of this scale a 
t recommended for use with 


ing in age from 18 to 27 months, it is no 


children below 24 months, or above 63 months of age. k 
The scale consists of ninety-three items arranged in order of difficulty. 


There is no attempt to group these according to types of function or 
behavior involved. The age norm for each item is given, this being the 
age at which fifty percent of the children were successful. Although there 
are ninety-three items, only thirty-eight are different items. Some (tWentye 
One) recur several times, at different age levels; at later ages a higher 
level of performance is required (in terms of quality or quantity of re- 
sponse, or in rate of activity) if credit is to be earned. Other items (seven- 


teen) occur only once. h E». 
The scale Pd some language (for example, What runs? What 
Scratches?” These are known as action-agent tests- i ir ealn En 
questions, such as "What does à doggie say? y manipulation of the 
body (opposition of thumb and fingers, crossing feet); motor skills and 
coordination (throwing a ball, buttoning); visual insights des with 
blocks, copying a circle and a cross, completing formboards and picture 


Puzzles); and recognizing familiar objects and colors: Aoo P ; 
Stutsman provides and su fuse Olea, (GUIOP. SOT ersonality 


Observations” in connection with this scale. Although these observations 


do not affect the scoring, they a nevertheless, useful to the clinician in 
] Re 4 
interpreting a child's responses during the examination. The following 


ggests the 
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traits are observed and rated during the testing period: self-reliance, self- 
criticism, irritability toward failure, degree of praise needed for effective 
work, initiative and independence of action, self-consciousness, spon- 
taneity and repression, imaginative tendencies, reaction type (slow and 
deliberate, calm and alert, quick and impetuous), speech development, 
dependence on parent, and others. Value of these observations and 
ratings, obviously, will depend on the skill and experience of the ex- 
aminer. These and similar observations, as already pointed out, are de- 
sirable, in fact essential, in the complete report on and evaluation of 
any individual's test performance. 

VaLipity AND RrLiABILITY. Criteria of validity were those generally 
used: (1) known groups, (2) ratings by nursery-school staff, (3) small 
overlapping of distribution of total scores between age groups, (4) corre- 
lation with chronological age (r = .92 = .004), and (5) correlation with 
the Stanford-Binet (r = .79 + .019 for 159 children in the standardization 
group, between g and 6 years of age). The correlation coefficient for the 
last criterion must be interpreted in the light of the fact that the age 
range was three years. d 

In the guide describing the Merrill-Palmer scale and its standardiza- 
tion, no data on its reliability are provided. Subsequent studies, however, 
furnish information from which we may infer its reliability. When Stuts- 
man retested a group of seventy-seven children (ages 2 to 5 years) after 
an interval of two months, she found a correlation coefficient of .72 
between the two sets of scores. Wellman (29), retesting a group of forty- 
four children (ages 20-62 months) after an interval of one week, found 
a coefficient of .92 between the two sets of scores. 


Other Scales 


Several other scales for infants and preschool children are also 
available. They need not be described, however, since their content has 
been derived and adapted from their predecessors. Two of these, both 
intended for use within the first year.of an infant’s life, are the California 
First-Year Mental Scale (2) and the Northwestern Intelligence Tests (13 
Kuhlmann's revision of the Binet scale (1922) provided test items from 
the age of 3 months to 15 years; his 1939 revision began at 4 months 
(20, 21). One, the Nebraska Test of Learning Aptitude (18) for the ages 
Of 4 to 10 years, was devised primarily for use with deaf children. Two 
of the scales prepared in England are the Griffith Mental Development 
Scale for Testing Babies from Birth to Two Years (17) and Valentine's 
Intelligence Test for Children, for ages from 114 to 15 years (27). 

Although considerable thought and effort have been devoted to the 


EVALUATION OF THE SCALES 315 


preparation of these scales, they all suffer, more or less, from defects and 
deficiencies of standardization. They are, for the most part, deficient in 
regard to validation and evidence of reliability. Population samples are, 
in some instances, inadequate or nonrepresentative, or both. The order 
of item difficulty is questionable in some instances. Some of the test items 
used in earlier scales have been modified, but the reasons are not stated. 
In this group, the scales that suffer least from these defects and deficiencies 
are the Kuhlmann and the Griffith. 

Critics of these and similar scales appreciate the difficulties inherent 
in any process of standardizing an individual scale, particularly at these 
early levels. Yet those who construct such instruments have an obligation 
not to present them for general use until they are supported by a reason- 


ably sound standardization process and findings. 


Evaluation of the Scales 


s should not be used in tests for 


TECHNICAL PROBLEMS. Speeded scale: ts f 
t of rate of performance 1s 1n- 


the lowest age levels. The measuremen 
advisable and can be misleading, for at least two reasons: (1) speed of 
performance has not yet become 2 motivating factor in very young chil- 
dren; (2) the shifting attention of children at these age levels can obscure 
their true levels of skill and insight. f a A^) 
The grouping of test items according to pes of activity, as in the 
icating functions that 


Gesell schedule, has the advantage of readily indica 
are retarded, and those that are accelerated, in the child being examined. 


While this kind of analysis is not so immediately apparent in an age 


Er ible. 
scale, suc ttell's, it 1$ nevertheless poss! C 4 
h as Ca , a are not available in standard- 


Si ual validit criteri i 
Foo end young children, this technical prone. can ays 
solved only through longitudinal studies, following Ds same i ividuals 
over a considerable span of years and correlating m y e pE pemane: 
with later acceptable i dity. Some ¢ orts have been made 
in this direction. 
Determination of item an 
difficulties too. If the odd-even 


by the fluctuating attention O 


used, the results can be affected by irregu owth e 
the time interval is significant. Significance of the time interval varies 


with the age of the subject: the younger the child, the shorter the sig- 
nificant interval. The desirable procedure would be to make retests 


within a week. 
Although most scales fo 


d subtest reliability, within a scale, presents 
method is used, the results can be affected 
f the subject. If the test-retest method is 
larity of growth tempos, when 


r these early age levels extend upward beyond 
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the 2- and 3-year levels, the use of the Stanford-Binet from age 2 is often 
recommended because of its more adequate standardization. Since the 
Cattell scale is an extension downward of the Stanford-Binet, and since 
it overlaps with the latter, it is a desirable alternate for the Stanford- 
Binet. The Merrill-Palmer has also been found to be quite useful to the 
age of 3 or 3% years. . 

Usrs. Psychological tests at these early ages are used for two main 
purposes: (1) to determine a child's developmental status with respect to 
the behaviors being evaluated at the time of examination; and (2) to 
predict future developmental and ability level, particularly in the cases 
of infants who are being considered for adoption. Most psychologists 
agree that the first purpose is reasonably well satisfied by the sounder of 
these scales. As for the second purpose, with one possible exception, the 
scales used with infants below the age of 18 months have proved in- 
adequate; for when ratings obtained within this period were correlated 
with ratings subsequently obtained with the S-B and other scales, the 
derived coefficients were so low as to be negligible. A recent, careful 
study of this question, in which the Yale (Gesell) scales were used, further 
supports this commonly held view (30). Therefore, when a child is ex- 
amined within the first 18 months of life, for the purpose of predicting 
future mental development, little weight can be given to the numerical 
rating, except in the cases of infants who deviate markedly in either 
direction from the average. The qualified psychologist, in evaluating 
and reporting test results, will note and interpret the qualitative aspects 
of the child's performance and general behavior. 

Cattell’s scale is superior in regard to predictive value. The first three 
coefficients (.10 to -34) in the table showing validity coefficients of this 
scale are characteristic of those generally found for the first year of life. 
But the remaining coefficients of validity are higher than those of other 
preschool tests. In general, the predictive value of scales for preschool 
children increases after the age of 18 months or 2 years. While correla- 
tional studies published on preschool groups aged 18 months or more 
show coefficients varying over a considerable range, many of them fall 
in the 40s, .50s, and .60s, with relatively few higher or lower. In general, 
the higher the preschool age at the initial test, the higher will be the 
relationship of initial scores to those on retests. 

In spite of their low predictive value for infants, available scales are 
of assistance to an experienced clinical psychologist in appraising à 
child's behavioral and mental development when attention is given to 
analysis of. performance on each of the various parts rather than to 
numcrical scores alone and when the analysis is used in conjunction 
with other clinical data. Developmental and intelligence tests for pre- 
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school children must be used with more than ordinary caution, for the 
value of the findings is exceptionally dependent upon the skill of the 
examiner in eliciting the child's best efforts and in being able to ap- 
praise his behavior during the examination. 

There are several reasons why results of tests in infancy and the earlier 
preschool periods do not have more value in predicting future mental 
status. Scoring must often be quite subjective, depending upon the 
examiner’s evaluation of behavior. Resistance to examiner, shyness, failure 
and other emotional conditions are undoubt- 
edly operative in some instances. A fundamental problem, however, is 
the fact that there are changes and irregularities in the tempo of develop- 


ment of numerous young children. It has been found that successive 
how fluctuations over two or three levels; 


examinations of some infants $ 
d downward or upward before leveling 


or they may show a consistent tren 
off to variations within a relatively narrow range of ratings. These fluc- 
t may be the result of changes 


tuations and trends in rate of developmen 
in mental organization; that is, of differences in the age of appearance 


of various functions and differences in their rates of development—ap- 
pearance of new functions and changes in rates being especially rapid in 
the first two years of life. : Ma 

Closely allied to these reasons is the fact that tests included in infant 
scales are dissimilar to those used at later age levels, so that little correla- 
tion is to be expected. The reader will have observed that tests used in 
appraising the development of an infant in the first 18 months of life 
are, for the most part, of relatively simple motor activities and of sensory 
perception. These have never been found to correlate significantly with 
tests used at later age levels, which increasingly involve the higher and 
more complex mental functions. It may be that psychologists will not 
be able to devise infancy tests having greater predictive value; for those 
functions which are subsumed under the term “intelligence” do not 
reach measurable magnitude until a later age That is, intelligence, as 
psychologically understood and defined, does not emerge sufficiently dur- 
ing the earliest phases of development. 

When tests are used with children below the age of 18 months, em- 
phasis should be placed upon an analysis of performances and upon 
evaluation of the child’s present status. Thereafter, the scales increase in 
value for the purpose of predicting later mental level. Bayley, after sur- 
veying her own and other researches, concluded that tests given between 
2 and 4 years of age will predict g- and g-year gntelgence test perform- 
ance with moderate success (r — 55) and tests given at 4 years of age will 
predict 8- and g-year performance much more satisfactorily (f. = 75): 
"These conclusions, however, Were written before the publication of Cat- 


to exercise maximum effort, 
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tel's scale with its validity data. It appears, therefore, that, as Cattell 
has shown, it is possible to devise scales of significantly higher gast 
value for use with preschool children who are more than 18 months old. 
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INTELLIGENCE TESTS AS 
CLINICAL INSTRUMENTS 


Clinical psychology involves, among other procedures, the in- 
tensive psychological study of individuals and uses testing, interview, 
observation, and history-taking as tools, in whole or in part. The purpose 
of such study, in some instances, is to determine the causes of each indi- 
vidual's malfunctioning and to prescribe suitable educational and psycho- 
logical measures to deal with the problem. These measures may include 
educational changes and adaptations, manipulation of the person's en- 
vironment in one or more of several ways, vocational guidance, counsel- 
ing, or psychotherapy. 

Not all persons coming to a psychological clinic, however, present 
problems of maladjustment. Some may be individuals seeking objective 
and psychologically valid information concerning their general and spe- 
cific abilities, interests, and personality traits. 

All types of tests have become indispensable instruments in the broad 
practice of clinical psychology. Every test can be clinical in a literal sense, 
since it helps to analyze an individual's abilities, to obtain a more nearly 
complete description of his strength and weaknesses. In this chapter we 
shall deal mainly with only two of the intelligence tests widely used in 
clinics—the Stanford-Binet and the Wechsler scales—but will consider 
briefly several other instruments devised particularly for the determina- 
tion of mental abnormality or deterioration. In a later chapter we shall 
Present some tests of personality and their clinical uses. Although tests 
of specific aptitudes and educational achievement are often essential in 
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the study of some individuals, they do not present the clinical problems 
and possibilities found in intelligence and personality tests. The clinical 
uses of aptitude and achievement tests, especially in cases of educational 
and vocational counseling, are much more obvious and direct. Nor shall 
we deal here with tests for infants and preschool children. These are not 
used in so wide a variety of clinical problems as are the Stanford-Binet 
and the Wechsler. Their organization, furthermore, is such that they 
readily lend themselves to analysis of performance in terms of sensory, 
motor, perceptual, language, and social development. 


Factors Affecting Test Performance 


discussed, it is necessary to point out 
ividual's performance on any psycho- 


logical examination. These are of two general kinds: intrinsic and ex- 
trinsic. By the former, we mean factors within the subject himself; by 
the latter, we mean those outside the subject. Lack of IQ constancy, in 
some instances, must be evaluated in the light of the following factors. 

INTRINSIC Facrors. Factors from within the subject that may affect 
test performance include (1) organic difficulties, such as defective hearing 
or vision; disability or enervation due to malnutrition or localized or 


Beneralized infections; glandular dysfunctions; acute or chronic illnesses 
that lower an individual's level of performance; brain damage (2) emo- 
nterest, lack of seriousness, delib- 


tional conditions as evidenced by lack ofi 

erate deception, negativism; inhibition due to shyness or lack of confi- 

dence; hyperactivity and restlessness; neuroses and the more severe forms 

of mental disturbances; (3) language handicaps; and (4) speech defects. 
The presence or absence of organic difficulties may be inferred from 

the subject's history or from reports provided by observers, such as teach- 

ers; but their existence and degree can be adequately determined only 


through physical examination. During the process of testing, however, 
the experienced examiner is often led to suspect, or infer, the presence 


of organic difficulties. The repo 
such instances, includes a description of 


formance that gave rise to the suspicion or inference. 
The psychological examiner must also be able to discern the presence 


of emotional factors in the subject's behavior or, where the nature of 
the instrument permits, in his actual performance on various parts of the 
test itself. This latter aspect will be discussed specifically in connection 
with the several tests to be presented in gr chapter. The discernment 
of negativism, deception, shyness, lack of interest, and lack of seriousness 
is a form of subtle clinical insight that can be developed only through 


Before specific tests are 
the factors that can affect an ind 


rt of the psychological examination, in 
the subject's behavior and per- 


MY 


Fic. 14.1. (above, left). Copy of a dia- 

mond drawn by a boy with impaired 

visual-motor functioning. The same boy 

drew Fig. 14.2; CA, 9-5; Stanford-Binet 

MA, 8-o; IQ, 84. Compare this reproduc- 
tion with 14.3. 


Fic. 14.2. (above, right). Drawing of the 
human figure by a boy with impaired 
visual-motor functioning. CA, 9-5: 
Stanford-Binet MA, 8-0; IQ, 84. This 
drawing, according to the norms of the 
Goodenough scale (Measurement of In- 
telligence by Drawings), gives this boy 4 
mental-age rating of 6-3, and an IQ of 
66. See also Fig. 14.1. 


Fic. 14.3. (left). Copy of a diamond by 

a mentally deficient boy. CA, 18-2; 

Stanford-Binet MA, 8-4; IQ, 63. Compare 
with Fig. 14.1. 
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the experience of testing a variety of individuals who manifest these 
thers who do not manifest them at all. 


traits in varying degrees, and of o 
Such contrasting individuals provide the necessary basis of comparative 


evaluations. 

Extrinsic FACTOms. The extrinsic category includes the following: (1) 
accidental factors, such as errors in time limits, broken pencils, distract- 
ing noises; (2) scale errors inherent in the tests themselves owing to im- 
perfect standardization; (3) scoring errors as a result of the examiner's 
judgment or the marginal character of the response—that is, a response on 
the border line between the acceptable and nonacceptable; (4) skill of the 
examiner, who must not only be thoroughly familiar with the instrument 
being used, but also must be able to establish the rapport necessary to 
elicit the subject's best performance. These sources of error imply that 
the examiner must be highly qualified and must know the standardiza- 


tion and limitations of his instrument. , : 

Administering a psychological test provides a situation in which the 
subject is psychologically observed as well as being scored in accordance 
With objective standards. In addition, analysis of an individual's per- 
formance on subtests and specific items often yields significant informa- 
tion concerning his mental status and mode of functioning. The follow- 


ing sections will be devoted to this aspect of test interpretation. 


The Stanford-Binet Scale 
Uses. In schools, guidance centers, and dinics, the problems for which 


intelligence tests are being widely used are as follows: 


. Diagnosis of mental deficiency and mental retardation. 
. Determination of mental levels of delinquents. 


i i i i deterioration. 
- Diagnosis of intellectual disturbance and t X 
$ Differential diagnosis (that is, mental traits and psychological profiles of 


various clinical groups). ; children 

.E h intelligence of maladjusted chi 3 PU AT 

d E Daoa ot Mine intelligence of children with special disabilities in 
learning. TP 

7. Determination of mental superiority. “problem” 

8. Educational guidance for children and others who are not “problem 
cases. 

g. Vocational guidance, fo 
essential. 


Functional Analysis. 


is not sufficient to find only the ™ 
Since an individual's performances 


A Q0 M m 


which the determination of intelligence level is 
r 


and other clinical problems, it 
d the intelligence quotient, 
] parts of an intelligence 


For these 
ental age an 
on the severa 
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test are not entirely uniform. He will be above his general average on 
some, and below it on others. The MA and IQ are composites, whereas 
in the analysis of an individual case, we must also determine the compo- 
nents that have yielded these indexes. The reader should recall that tests 
are standardized on the basis of performances which are characteristic 
of representative groups of persons at particular age levels; but it seldom 
happens that a given individual conforms to his age group in all respects. 
Two identical mental ages may differ functionally in terms of their com- 
ponents, as inspection of the results on two Stanford-Binet tests will 
reveal. For example, two children, each with a mental age of 10, may 
vary in respect to basal year, extent of scatter (see below), and items 
passed or failed at each of the several levels. 

Functional analysis of Stanford-Binet test results is not readily apparent 
because the items are grouped according to age levels.rather than accord- 
ing to uniform types of subtests, or factors. (Compare the Stanford-Binet, 
for example, with the Wechsler scales [Chapter 11] and the Chicago 
Tests of Primary Mental Abilities [Chapter 17].) It is possible, however, to 
draw inferences about an individual's strengths and weaknesses through 
analysis of his responses to the various test items. For example, levels 
may be determined in the following functions: visual perception of form 
(three-hole formboard, copying a square and a diamond); visual imagery 
(copying a bead chain from memory, and paper-cutting test); visual mem- 
ory (reproducing geometric designs); thinking (absurd sentences and 
arithmetical problems); memory span and attention (recall of digits and 
of meaningful sentences); ability with abstractions (verbal similarities and 
differences); reasoning (plan of search, picture absurdities, and picture 
interpretation); word knowledge (vocabulary); concept formation (defini- 
tions of abstract words). Analysis of successes and failures in these oF 
similar terms will assist the examiner in determining the psychological 
components of average or superior mental-age rating, or the deficiencies 
responsible for an inferior rating. 

Analysis is valuable in detecting the causes of specific learning dis- 
abilities, as in reading. For example, the test items may reveal deficient 
perception and recall of visual patterns, defective visual recognition, poor 
copying and reproduction of form, and short memory span—all of which 
are causes of confusion and difficulty in learning to read and in the 
mastery of other school subjects (see Figs. 14.1, 14.2, and 14.9). When 
such defects are revealed by the Stanford-Binet scale, further intensive 
examinations, with special tests of vision and of the functions required 
in reading, are indicated. 


As illustrations of evaluations of Stanford-Binet performance, the fol- 
lowing reports are cited. 
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Range of Allen's performance extends from the 8-year level to his CA 
level (10 years). His comprehension and reasoning with abstractions are at 
the level of his chronological age. Relative to his average level, Allen is 
generally superior in dealing with verbal abstractions. Comprehension of 
problems, analysis of absurdities, and problems of logical fact rank relatively 
high in his performance. He is weak in dealing with number concepts; can- 
not make change accurately. Visual perception and visual memory are in- 
ferior to chronological age expectancy. Memory for verbal materials is some- 
what inferior to CA level. Vocabulary is inferior; he fails to pass at 10-year 
level; failed rhymes; word association is poor; reads with difficulty. (IQ about 


90.) 


Lois' test performance was not consistent. She did well on visual items 


requiring form discrimination and identification, but completely failed 
esthetic comparison. She was unable to reproduce a square correctly, al- 
though hers was a close approximation; and she completely failed to grasp 
the concept of man-completion. However, she was able to copy folding of 
paper into a triangle and she did pass maze tracing. She consistently failed 
on items requiring memory; she could not remember either digit spans 
or sentences, although her speech difficulty may have been a major influ- 
ence, especially in the latter. She seemed unable to grasp concepts as a 
whole; she often misun 


the impression that 
she occasionally made correct respo 
sult of understanding. She was self-critica 
she did painstakingly and with self-appraisal 
also was critical of her paper triangle and too! 
after saying, "That's wrong," of her first attempt. 
aware of her failures in other respects. (IQ about 55.) 
Nancy was a very coo erative subject. She was somewhat apprehensive 
when che came ae ei room, but quickly relaxed and entered 
actively into the tasks. She had a long attention span and was not distracted 
by outside stimuli. She thought carefully before answering difficult questions, 
and was self-critical, for she sometimes thought aloud and corrected herself 
as she thought. She was rather self-confident and failure on items, some of 
which she recognized herself, did not seem to bother her. She laughed 
occasionally, but on ous most of the time, giving all her 


the whole was seri 
attention to the test. (IQ about 125) 
methods of measuring scatter: 


SCATTER ANALYSIS There are several € 
(1) the number of age levels, from and including the basal year, up to 


the highest level where any items are passed; this is known as range of 
scatter; (2) the number of items passed above the individual's mental- 


age level and number failed below; this is known as area of scatter; (3) 
and failures over the several levels 


«tribution of successes 
the distribu assed or failed by an individual spread 


1 Scatter analysis means 3 
t to which test items P' 


of difficulty, or the exten 
over different levels of difficulty. 
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a combination of range and area, whereby the number of worries 
failures at each level is weighted according to the distance of that E 
from the mental-age level. If only one method is to be usen, the first be 
these is preferable because it covers the entire range of perio mance a 
expresses the situation in the most direct and simple madingi: - 

Investigators have reported that psychotic persons show gr eater sc 
than do normal individuals or the nonpsychotic mentally deficient. Many 
reports state also that excessive scatter is a useful diagnostic sign, tas 
ularly in the case of organic psychoses, which apparently yield the gre 
est degree of scatter. > 

In adition to general scatter, analyses have been made of "selective 
scatter"; that is, the effects of specific psychotic conditions upon E 
to pass specific kinds of test items. For example, schizophrenics are iN 
to be less proficient than normal persons in detecting absurdities, PE 
preting fables, solving the “purse and field" test, memory for designs, n 
passing problem questions. On the other hand, they are said to be re 
atively proficient in vocabulary (word meaning). i 

These views on both general and selective scatter are tentative. There 
are several reasons for disagreement among the published findings, one 
or more of which might operate in any given investigation. These e 
clude inadequate control groups, inadequate control of mental age an 
chronological age of compared groups, divergent techniques of scatter 
analysis, and errors of psychiatric diagnosis.? 

In view of inconsistent data, we must conclude that numerical measures 
of general scatter on the Stanford-Binet scales are, at present, of limited 
use as clinical aids, so far as most individual cases are concerned. On the 
other hand, extreme degrees of scatter have been found diagnostically 
valuable often enough so that analysis of scatter on the Stanford-Binet 
continues as a clinical technique. 

Regarding selective scatter, there is a greater degree of agreement; for 
the published researches deal mainly with schizophrenics, among whom 
there are a number of different syndromes that affect test performance. 
Other abnormal groups have also been studied. Although investigators 
have not been in complete agreement regarding the several abnormal 
groups, all reports show that schizophrenia involves selective impairment 
of the mental functions tested. It is to be expected that the particular 
functions impaired, and the degree, will vary with the particular type of 
schizophrenic disorder; but in this group as a whole, test performance 
tends to be inferior more often and more markedly in parts requiring 
practical reasoning and judgment (What is the thing to do when . . . ?), 


in abstract reasoning (arithmetical problems), and in perceptual organiza- 


For a comprehensive bibliography on scatter, see D. Rapaport (51, pp. 554 ft). 
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tion. Inferior performance is revealed not only in the lower scores, but 
in the quality (that is, bizarreness, irrelevance) of responses, and incon- 
sistencies (failing easy items and passing more difficult ones). 

In the cases of most delinquents and mental defectives—nonpsychotic 
atypical groups—a quite different pattern of performance was found. 
Their vocabulary and Stanford-Binet scores were low, while their per- 
formance-test ratings were higher. Some individuals in these two groups 
however, had relatively low performance-test ratings; but they were found 
to be the least likely to make a satisfactory social adjustment. 

It also has been reported that among maladjusted children, most of 


those with relatively high performance-test ability were individuals re- 


ferred to clinics as "delinquents"; whereas children with significantly 
e having per- 


lower performance than verbal ratings were mostly thos 
sonality defects (psychoses, neuroses, emotional instability). : 

Several important statistical facts should be considered in connection 
with the interpretation of scatter on the Stanford-Binet scales. The first 
is that the percentages of the standardization population passing each item 
at any given age level are not all equal. For example, at age 7 (Form 
L), the percents passing vary from 51 to 70; at age 8, from 57 to 67; at age 
12, only from 61 to 6g. These differences signify that the items at a given 
age level are not all of equal difficulty; hence uniformity of performance is 
not to be expected. It will be recalled that placement of items was made 
on the basis of several considerations, only one of which was percent 
Passing. : 

Other technical factors that ml 
each item; low intercorrelations 
make demands upon a more or 


ffect scatter are the reliability of 
between items, some of which might 
less specialized ability; or items incor- 


rectly placed in the age scale. These technical factors may account for part 


unt for all of it. 
ee S rever, they cannot acco : i b 
of the scatter found; howeve x een increasingly interested in 


i b 
CONTENT ANALYSIS. psychologists Y their possible significance re- 
analyzing verbal responses to test ' i its. That is to say, the 
7 . d personality traits. y 
Barding the testee's experiences an ue as projective 
Std e : d os individual scales, have some val proj 
ord-Binet, an. 


d 26. 
tests of personality. (See Gips ua 5 eh reveal current interests, 
The word-naming item (year ^» 


n. One 10-year-old bo 
family conditions, school re a 4 , 


ght a 


Jationships, e , : 
i 3 ool, ugh, arith- 
(IQ 1 33) interpolated the following xo uel js es inane dif- 
metic, dull, stupid. In spite of his high vU aud :with Bipeshches. That 
ficulty with his studies (particularly arn was clearly revealed in a 
the rce of anxie E ion Test. Oth 

Jod a Egan and in the 2s EE IS AS CIN and WE 
Psychologists apott words of violence ambition, 
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forth. Verbal responses may thus reveal more or less deeply rooted emo- 
tional conditions. 

Word definitions may reveal experiences and centers of interest. For 
example: when “puddle” is defined as "dirty water that you should not go 
into"; or "lecture" as “a lot of yak-yak from a grown-up.” There are also 
times when definitions will be concentrated in a special area (for example, 
sensory), so as to suggest special interests. Some give definitions attended 
by compulsiveness, uncertainty, or overelaboration. 

Values and behavior patterns also are revealed at times. For example, 
under “Comprehension,” one of the questions is this: "What is the thing 
for you to do if another boy hits you without meaning to do it?" A com: 
mon, and expected, answer is: “Nothing. He didn’t mean it.” But at times 
a boy replies: "Hit him back,” or “Go home and tell my mother (or 
teacher)." T 

Picture interpretations also may offer an opportunity to gain insight 
into aspects of behavior and personality. The responses to pictures may be 
moralistic, or submissive, or hostile, or anxious, etc. Or they may be of the 
ordinary, expected variety. 

Some children who give ready and assured responses to routine ma- 
terials, such as recall of digits and sentences, may be anxious, hesitant, and 
tentative when dealing with items that require their own judgment and 
evaluation, thus suggesting dependence or submissiveness. ] 

"ExrRA-TrsTING" Procepures. While testing, it is possible for experi- 
enced examiners to utilize certain practices not prescribed in the manual, 
nor even contemplated by the test's author. As examples, the three follow- 
ing practices are cited. After the formal testing has been completed, 1t95 
sometimes desirable to check on selected item-failures to determine 
whether failure was because of lack of capacity, verbal handicap, or !n- 
ability to understand instructions. The probing may be done by giving 
actual instruction on some items, then testing the individual further on 
the same types of items. The results, of course, do not change the sub- 
ject's score on the test; but the procedure and its results should be noted 
in the report. For example, if the child has failed on Similarities and Dif 
ferences (year VIII, item 4) when testing has been concluded the ex 
aminer might return to that item, explain the terms and the nature of the 
problem, and ever. give the correct answer to one of the four parts. He 
then asks the subject to give the answers to the remaining parts to deter- 
mine if there has been any learning. 

A second form of extra-testing procedure is intended to discover how far 
a testee has been able to proceed on an item he has failed, as in arith- 
metical problems. The examiner returns to the problem; he leads the 
testee through the solution step by step, at times suggesting the procedure, 
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or even the basic process to be used. Thus, it is often possible to discern 
the nature and extent of a subject's deficiencies. 

If a child has failed to copy the diamond satisfactorily (year VII, item 
3), the examiner may want further information on whether the inability 
might be due to disturbed visual-motor functioning or to defective per- 
ceptual analysis. He may, then, make three copies of the diamond, of 
varying quality, one of which is a duplicate of the testee's own copy. The 
testee is then asked to select the "best" one. It has been found that the 
child whose poor copy is consistent with his general mental level will not 
select the examiner's superior reproduction, or will select at random; but 
a subjeet whose general mental level is higher than that suggested by his 
own copy can make the correct selection, even though he still will not 
be able to produce a satisfactory copy on subsequent trial. 

"These are three illustrations of extra-testing procedures that enable the 
examiner to prepare a richer and more analytical report. — 

ORDER or ITEM ApMiNISTRATION. The order in which items of the 
Stanford-Binet should be presented has been under examination and dis- 
cussion in recent years because clinicians have found that order of pres- 
entation in some instances influences performance of maladjusted sub- 
jects. Three methods have been used: (1) standard, or consecutive; (2) 
serial; (3) adaptive. Using the first of these, the examiner presents the 
items in the order in which they appear in the scale. Using the second, he 
follows through on one type of item, to the more difficult ones, D ud 
subject fails: for example, Memory Span for Digits, Memory for Sen- 
tences, Similarities and Differences, Comprehension, and so on. If the 
third method is used, the examiner starts with easy items and alternates 
these with difficult ones of the same type. When employing due adaptive 
method, the examiner begins at a level below the subject's QU Te 
mental age (to insure success) and moves up and Wm in ani ort to 
establish the maximal (or terminal) and the basal leve s as Cd as Pos 
sible. The principal justification given for the adaptive ais is t 
some individuals are discouraged bya series of consecutive ailures w! en 
they reach higher and more difficult levels; they mien Poss dis 
items that they could otherwise pass. The adaptive method is intended’ to 


scatter failures among successes. 
The sparse experimental data available support us e of Be adap- 
tive procedure. It has been found that there is little difference between 


EF d and the adaptive methods when a well- 
XS ODIO By M SEM norms established by Terman and 


adjusted group is tested. "Thus, the > S 

Merrill ERA standard testing are applicable when the adaptive 
method is used. One study reports a correlation of .93 between results ob- 
tained when the two methods were compared, using Forms L and M. The 


330 INTELLIGENCE TESTS AS CLINICAL INSTRUMENTS 


order of testing. One study reports a mean gain of eleven points when the 
adaptive method was used, as compared with the standard consecutive 


Clinical Diagnosis, For the purpose of analysis of functioning 
and performance, the Wechsler scales have an apparent advantage over 
the Stanford-Binet, since the items are grouped into subtests, each ex- 
amining mental operations that might be susceptible to different forms of 


the preparation of analytical profiles for individuals; it also permits the 
ready analysis of subscore interrelations, which can be studied for any 


groups of persons. These researches deal with differences in extent of 
Scatter among the severa] groups and differences among subtest scores 
within groups, emphasis being placed upon the latter, 

The published studies are devoted Principally to the determination of 


cause they indicate efforts being made to determine the analytical and 
clinical values of the Wechsler Scales, 


Among these Studies are those especially concerned with the extent of 
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deterioration, or deficit, in intelligence suffered by members of each of 
several groups: alcoholics, schizophrenics, manic depressives, paranoiacs, 
Psychoneurotics, and others. One method commonly used has been to 
compare the averages of mental ages and the measures of variation of a 
sample of abnormal persons with corresponding indexes found for a 
sample of normal persons. The theory is that inferior status of an ab- 
normal group, as compared with one that is normal, is a result of the 
mental disorder. On the whole, results indicate that groups suffering 
from organic psychoses (for example, alcoholism, paresis) show the greatest 
loss; those suffering from functional psychoses (schizophrenia, paranoia) 
show less; while those in the psychoneurotic group give no evidence of 
deficit. The published studies do not agree as to the average amount of 
deterioration in each group. Furthermore, within each category of mental 
illness, there are marked individual differences in the degree of deteriora- 
tion (32, 51). r 

Because of these variations and the overlapping of group distributions 
of ratings, the validity and clinical use of group deficits have been seri- 
ously questioned in making individual diagnoses. Investigations of group 
deficits have been valuable insofar as they have demonstrated that some 
degree of deterioration of mental functioning accompanies mental dis- 
order, Since clinical practice, however, is concerned with each person as an 
individual rather than as a member of a specified group, it is necessary to 
determine individual rather than group characteristics or group losses of 
mental ability. » À , 

On the other hand, for the use of scatter analysis in UNIS diagnoses, 
it may be stated on the positive side that the test “signs ofa particular 
Syndrome will often correctly identify a small, but nevertheless significant, 
Percentage of cases.* Furthermore, the examining psychologist has avail- 
able information about the individual (the case history) in addition to 
results from the objective test. The psychologist's Ad aude. A 
Will be based upon these several sources of information and upon their 
Interr i 

sehen er Analysis. Brown, Rapaport, et al. Pu We an 
intraindividual measure of scatter (51). This consisted of: (1) calculating, 

“ne i "E. that when one looks for unique patterns which are both 
highly MU nid ar selected a a ik WU des 
t Ed REBeviguale testedi, This faot has ieni tamed the method of successive sieves. The 

yc igs an t hypin that one can arrive at diagnostic combinations 
FE easily by ater: succi SR ce pha ubl pa ee ped 
wage Pena D be dm Pattern. But the possible Eois may n 
eventually detected can be increased conse A A TEE 168) second, 
third and fourth sieves [criteria] to the same population" (64, pp. 167 4 


method , . . 
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for each individual, the deviation of each subtest score from the mean of 
all his subtest scores; (2) computing the deviation, for each individual, of 
each verbal-subtest score from his verbal-subtest mean; (3) computing the 
deviation of each performance-subtest score from the performance-sub- 
test mean. 

Scatter of subtest scores was further investigated, resulting in a table of 
test characteristics, or patterns, of several clinical categories: organic brain 
disease, schizophrenia, anxiety states, adolescent sociopaths (delinquents), 
and mental defectives (64, pp. 171-172). For example, the pattern of or- 
ganic brain disease shows relatively high scores on Information, Compre- 
hension, and Vocabulary (very high); but moderately low scores on Arith- 
metic and Similarities; while the scores on Digit Span, Digit Symbol, 
Block Design, and Object Assembly ‘may be very low, relatively. The pat- 
tern of adolescent delinquents is characterized by relatively high subtest 
scores in Picture Arrangement and Object Assembly; moderately high 
scores in Picture Completion, and ‘Block Design; relatively low scores 1n 
Information, and moderately low or average scores in the remaining sub- 
tests.5 

The explanation of these and other patterns is, presumably, that mental 
defect, each Major type of personality disorder, and mental illnesses are 
associated with characteristic deficits or losses of psychological function- 
ing. Mental defectives are individuals whose mental development has been 
arrested at a relatively low level, though not uniformly in respect to all 
functions. A personality disorder or a mental illness produces impairment 
in the individual's mental abilities, but the resultant losses are not uni- 
form; apparently some; functions are more adversely affected than others. 
Thus, if a' psychological test is able to measure and portray these dif- 
ferential deficits and losses, it can assist significantly in making a clinical 
diagnosis. im na 

In Rapaport’s' investigation, discussed at the beginning of this section, 
essentially two forms 6f scatter analysis wére employed. 

1. Vocabulary scatter. This is the difference between a person's score on 
a particular subtest and his score on the Vocabulary subtest. The reason 
for using Vocabulary as'the base of ‘comparison is that it has rather con- 
sistently been found to be the psychological test least vulnerable to im- 
Parrment by personality disorders or mental disturbance. It is the Vo- 
cabulary score, therefore, from which the individual’s original, unim- 
paired intelligence level is inferred. Degree of loss in other functions can 
thus be estimated from the differences between ratings on each of the 


subtests and the rating on Vocabulary. 


5In : Bo 
e instance, each subtest Score is high, low, or average with reference to an 
5 own average score; not in relation to average performance of a group. 
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2. Mean scatter. This is the difference between the rating on any single 
subtest and the average of the ratings on all the remaining subtests. The 
mean scatter, which may be positive or negative for a given subtest, meas- 
ures the relationship of a single measured function to the average of all 
other functions measured by the test. It is thereby possible to find out 
whether an individual's level of achievement and functioning on any of 
the subtests has deteriorated more or less than on the remaining ones. 

Mean scatter can be calculated for the entire test, including both verbal 
and nonverbal subtests, or it can be found separately for the six verbal 
and the five nonverbal scores. | 

A third procedure—one that supplements scatter analyses and is re- 
garded as essential—is the comparison of pairs of subtest scores. Any two 
Subtest scores may be compared for the purpose of finding out how the 
Subject's functioning in one compares with his functioning in another. 
For example, how does his retention of Information (remote learning) 
compare with his Memory Span for Digits (immediate recall and atten- 
tion)? How does his Arithmetic score (reasoning and habitual responses) 
compare with his score on Similarities (concept formation)? Thus, it is 
Possible to obtain, in considerable detail, an analysis of the subject's 
achievement levels on the several parts of the scale, their interrelation- 
ships, and evaluations of general and special mental pen tuer T 

The profile in Figure 144 illustrates the meee have gi b 
cussing. We quote, also, from the interpretation of this s E! o 
clarify the profile, but also to emphasize the fact RAUS. UN 5 
analysis is inadequate and to illustrate how the psychological ri 
the sc ilized in interpretation. 

It Fa ee diat Pu all clinical cases prentsan m 
terns of test performance upon which a differentia g y 

p scatter and subtest analyses have 
made. In an appreciable number of ca at the analyses, though not 
been found definitely diagnostics in ot pies diagnosis; in still others 
Conclusive, provide indications of the P elusive. Although test analysis 
the scatter profile and analyses are incon! 


icu the practice is war- 
does not provide a definitive diagnosis 1n a SEE OE Such 
ranted because it is as efficient a° pines ae Its obtained with other 
analyses are interpreted in conjunction with Shot) and other clinical 
Psychological tests (for example, projective me 
and developmental evidence. 


it is necessary to consider 
In drawing inferences from a sca it ary 


tter analysis, 

itted Digit Span and Arithmetic. This 
“In calculati ean scatter, Rapaport omitte E : 
omission, he eng. Ms was warranted by the fact tiar ee at he ace, 
tions] were so general in most of the clinical and SO EUR Serene A 
Would have vitiated the representativeness of the mea 


Scores" (51, p. 537). 


Neurotic Depressive 
6 7 8 9 10 H 12 13 14 15 16 17 


Comprehension 
Information 
Digit Span 
Arithmetic 
Similarities 
Vocabulary 
Pict. Arrang. 
Pict: Complet, 
Block Design 
Object Ass, 
Digit Symbol 


Fic. 14.4. “Outstanding in the scatter is the great discrepancy between 
the verbal and the performance subtest scores, and this discrepancy We 
have found to be a statistically significant, and therefore diagnostic, in- 
dication of depression. The rationale of this finding is the following: 
depression becomes manifest in intellectual functioning by a retardation 
of perceptual and associative processes. The relatively complex visual 
organization and visual-motor coordination required by the performance 
subtests put too great demands upon the slow-downed depressive. Fur- 
thermore, in contrast to the untimed verbal subtests, the performance 
subtests have time limits on each item and even give extra credit for 
speed. Consequently, depressives not only do not obtain extra credit, but 
exceed the time limit on many items. For this case, item analysis confirms 
the retardation by showing that, on Picture Completion and Block De- 
sign, a number of items were failed only because they exceeded the time 
limit . . . the impairment of Digit Span or attention is also striking 
and reflects the presence of intense anxiety accompanying the depression. 
The mild impairment of Arithmetic is referable to an inability to meet 
the time limits and gain time bonuses on the items of this subtest. In 
verbalization, much self-depreciation, as well as indirect criticism of the 
test and the examiner, are evident." R. Schafer (56). By permission. 
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the possible influence of environmental predilections or pressures, which 
may enhance or detract from certain forms of test achievement. In such 
an event, the diagnostically distinguishing features of intelligence tests 
will be affected and probably invalidated. Thus, before drawing diag- 
nostic inferences from intelligence-test results, the psychological examiner 
must know whether, in the case under consideration, there have been 
educational, cultural, or other environmental factors that might account 
for any of the ratings and seemingly diagnostic criteria. 

Another aspect to consider in evaluating the significance of subtest 
differences is their reliabilities. It will be recalled that the reliability co- 
efficients of the eleven subtests are not consistently high; thus, only large 
differences in weighted (scaled) scores may be accepted with confidence. 

Qualitative Aspects of Responses. Scatter analysis and subtest 
comparisons should be accompanied by a qualitative analysis of the sub- 
ject's verbalizations and general approach to the test problems. Qualita- 
tive analysis reveals characteristics of the individual's mentality in opera- 
tion, such as excessive doubt, indecision, self-criticism, impulsiveness, 
bizarre notions, obsessions, compulsiveness, and random guessing. Certain 
qualitative characteristics are frequently associated with clinical groups. 
Thus, a person’s mode of approach to test problems is significant in the 
interpretation of his behavior and understanding of his personality. 
Classification and labeling are less significant, in an individual case, than 
a description and analysis of behavior and functioning, as shown during 
the test period and as reflected in the test results. af ; 

A person's mode of approach will affect his performance i sae y on 
€ach of the various types of subtests, depending on whether the subtest 
makes demands on habituated responses, for example, or n those re- 
quiring flexibility and reorganization. Individuals with brain PAM 
as a case in point—show rigidity, inability to shift attention, ae ity to 
change their mode of responding, inability to ignore superficial or ex- 


traneous stimuli, and difficulty in organizing material into a pattern or a 
tion of the subtests will show that per- 


meaningf i ence. Inspec à N 
iR af cae ate be sical affected by behavioral traits such as 
these, 

Failure to respond correctly to a 
Portance. While correct and accepta 
research and by agreement of psychol 
9f findings, incorrect responses are no 


Sponding. To illustrate: current scales h ems 
hension,” “Similarities,” and “Arithmetical Reasoning.” The first are of 


the “when” or the "why" type; that is, "What is the thing to do 
When .. . >” or “Why is it desirable (or necessary) to . . . ?" The sec- 


test item is not the only matter of im- 
ble responses to test items are fixed by 
ogists regarding the interpretation 
t fixed; nor is the manner of re- 
ave sets of items testing “Compre- 
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ond, “Similarities,” requires understanding of basic likenesses—for Lee 
ample, “In what way are ———— and —— alike?" The arithmetical 
problems range from the simplest to fairly complex. A person's response 
to any of these may be right or wrong; but his speed and confidence in 
answering, his anxiety or blandness about incorrect responses often re- 
veal significant personality traits. When, in a test of Information, an 
otherwise intelligent person blandly replies that Tokyo is in Turkey, or in 
a test of Similarities, that a dog and a lion are alike because both have 
digestive organs, and gives other bizarre responses, personality disturbance 
is indicated. 

Personality difficulties of the obsessive kind, for example, must be con- 
sidered when a subject feels compelled to offer four or five explanations 
of courses of action in reply to a “Comprehension” item; or when he 
mentions three, four, or more likenesses on some of the “Similarities” 
items; or when he gives elaborate and often quibbling definitions of words 
in the Vocabulary test. Or if the person persists in guessing blindly on 
test items that are clearly beyond his level of ability, his test behavior may 
be indicative of an uncritical impulsiveness. 

The previous examples illustrate the role of language with respect to 
certain types of items. Some personality patterns or categories may, how- 
ever, be inferred from the manner in which the person deals with the 
test in general and as a whole. A few illustrations follow. 

In the obsessive-compulsive individual, verbalization of responses is 
over-detailed and doubt-laden. For example, in response to the question: 
“What does ‘stanza’ mean?" the ordinary person who has the information 
would probably say, “A group of rhymed lines," or something similar. A 
characteristic Obsessive-compulsive response, however, would be conr 
parable to this: “A stanza of rhymed lines forming one of a series of simi- 
lar divisions in a poem. Two rhymed lines form a couplet. A four-lined 
stanza is a quatrain, a six-lined one is a sestet," etc. One or a few such 
replies do not warrant the characterization of the respondent as **obsessive- 
compulsive"; but when this kind of answer is "idiomatic" of the indi- 
vidual, such characterization is indicated. 

Persons in an anxiety state also frequently give responses that are 
typical of that group. In general, their behavior is characterized by rest- 
fess, apphenirens imped dien and concentration, an 

(such as tics, nail-biting, fidgeting, coughing). This 


psychological state is manifested, in the test situation where language is 
required, through difficulty with finding words, impulsively blurting out 
unfinished, unchecked, or 


inappropriate replies, or fumbling about for 

ad 2 > 

EE formulations, For example, when the question is: “How many 

n A es In a year?" the anxiety-ridden person may reply: “There 
doowceksunva-year; no... let's see... . or is it po»... wait a 
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minute . . . let's see . . . 12 months . . . 4 weeks in a month . . . yes, 
that's right, 48." Or consider the question: "Why should we keep away 
from bad company?" The person in an anxiety state may give a reply of 
the following sort. "If I was in bad company, I'd get away quick. I 


wouldn't want to be with them in the first place. They . . . er . . . are 
unlawful... and... ah . . . get you in trouble. . . . I don’t think 
a person should be in bad company; that is, . . . er, ah . . . if he was 


brought up right. Anyhow, I'd leave them!” 

The test responses of psychotic persons, also, are quite significant for 
diagnostic purposes; but the subject of psychotic responses is a complex 
one. One aspect may be indicated here—disorganization of thinking and 
bizarreness of responses typical of schizophrenics—both, of course, being 
indicated in the language of their answers. For example, the former 
teacher of history who cannot correctly give Washington's birthday, or the 
clergyman who cannot define the word "vesper" are instances that in- 
dicate disorganization of memory and loss of previous knowledge. Simi- 
larly, in replying to the “bad company” question, when one gives an emo- 
tionally intense and moralistic response, and explains, irrelevantly, why 
people should be “good,” such a response suggests serious impairment of 
judgment. Or, in the Vocabulary test, when a subject gives impulsive 
replies, such as defining “belfry” as “a kind of bellboy”; or "repose" as 
"to pose over something." These bizarre answers indicate an impulsive 
"clang association." : 

DETERIORATION QUOTIENT. A procedure for estimating the approxi- 
mate amount of mental loss is provided in the WAIS manual. This index 
is based upon observations that some of the functions tested decline more 
rapidly than do others. The difference between the rates of impairment 
of the two types of functions is regarded as the indicator of an indi- 
Vidual’s degree of deterioration. The tested functions that decline most 
markedly with advancing age are placed in the “Don’t Hold” category. 
Those that decline least are in the “Hold” category. The subtests in the 
latter group are Vocabulary, Information, Object Assembly, and Picture 
Completion, In the “Don’t Hold” category are Digit Span, Similarities, 
Digit Symbol, and Block Design. , 2 

ected with advancing 


Since some loss of tested abilities is generally exp yane 
age, this factor is taken into account in calculating the deterioration 


quotient, as shown in the following expression: 
Deterioration Quotient = Hold 


For example, assume that a man 35 years of age has the following scores: 
Hold = 5o; Don't Hold = 35- Substituting these values in the formula 
yields a quotient of go. The quotient is then compared with the norm for 


X 100 
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the individual's age group to determine whether his loss is significantly 
greater or smaller than that expected for his age. ME TE 

Published researches on this index do not agree on Denar enteral 
ity. Some support the technique; others do not; still apes v pir aar 
The criticism is that the concept of functions which «x Reis 
Hold" might apply to some age groups and clinical categori dur: nds 
others, and that with many persons selective rather than ov 

le. . bc 

Es soundness of this technique depends, also, upon a high eem 
of each subtest and upon the adequacy of standardization at Me ics 
level. It is reasonable to conclude that the deterioration index d 
significant where the extent of decline is considerable but uds "e 
In conception, this index, or a similar one, provides an impo mie 
nique in estimating changes in mental abilities, since it is se a 
sible, when the subject is an older person, to obtain mental-test 
that were found at his maximum level of development. 


A Report Outline 


The following is an example of the type of outline used e 
ing graduate students to prepare reports of intelligence tests aimo A 
by them under supervision. Such outlines help to clarify and empha 
the significance given to qualitative aspects of test performance. 


Psychological Examination 
Name: (last, first, middle) 


Date: 
Age: (in years and months) Date of birth: 
Referred by: 


I Introductory Statement 
a. by whom tested 
b. reason for testin; 
IL Name (in full) of the test used 


LEA ; : itua- 
HL State the individual's Beneral attitude and response to the test sit 
tion and to the examiner. 


IV. Test findings: 


Stanford-Binet 


WAIS or WISC 
4. Mental Age 4. Verbal IQ 
b. Intelligence Quotient b. Performance IQ 
c. Basal Year c. Full-Scale IQ 
d. Terminal Year d. Classification 
e 


- Classification (average, 
superior, etc.) 
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V. Test evaluation 
a. Analyze the performance on: 


1. 
2. 


verbal materials 
nonverbal materials 


b. Compare the MA and IQ with the Vocabulary test score. 
c. Point out the quantitative and qualitative aspects of the results: 


2. 
3. 
4. 


strengths 
weaknesses 


scatter analysis 
quality of language (word definitions, use of language, 


grammar, richness of vocabulary, nuances, etc.) 


VI. A summary statement of test performance 
VII. Subject's behavior while being tested 


a. Reaction time: 
1. Were responses delayed, blocked, irregular? 


2. 
3. 


b. Nature of responses: 


1. 
2. 


3. 
4. 
5 


Was there any indiction of negativism? 
Were responses given quickly or impulsively? 


mature, childlike? 


Are some nonsensical, im 
f good quality; or are they 


Are they, on the whole, ¢ 
inconsistent? 

Is there confabulation? 
Does the subject ask for help? 


Is the subject critical of his responses? 


c. Depth of responses: 


1. 
2. 
$: 
4. 


Are they “surface” responses? 

Do they show depth of understanding? 

Does the subject try to appear penetrating? 

Does the subject adopt a “playful” (defensive) attitude? 


d. Self-references: 
wer referred to the self? (Describe 


9- 


e. Evidence of confusio 


1. 
2. 


EU 


. Are the responses i 


Is the question or ans 
and analyze the references.) 
n terms of the subject's own or im- 
mediate experiences, or in terms of someone else's? 

Does the individual give expression to his feelings dur- 
ing the testing (orally or by body movements)? 

n or doubt: 

Do test questions have to be repeated? 

Does the subject change his answers? (Under what condi- 


tions?) 


Are questi 
explain in what way. 


ons misunderstood oF misinterpreted? If so, 


f. Verbalization: 


1. 
2. 


3: 


Is the subject verbose? , 
Is he spontaneous in responding? 
Does he have peculiarities of speech? 
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g- Organizational methods: . 
1. Is the individual careful or overmeticulous? M a 
2. Does he plan and work systematically? Or is his a ra 

dom approach? 

3. Does he make many false starts? 
4. Does he generalize readily? A 
5. Is there evidence of perseveration? 

. Adaptability: 

2% r Does the subject shift readily from one test to the next? 
2. Is his interest sustained in all types of test items? 

i. rdination: 

Bs thie the gross and finer movements skillful or awkward? 
2. Can he smoothly execute bilateral movements? 

j. Effort: 

1. Is the subject cooperative? 

2. Does he give evidence of trying hard? 

3. Does he attend with ease or difficulty? 


- Was the individual readily upset, irritable, argumenta- 
tive, stuporous, happy, sad, depressed? 

- Were there any emotional outbursts? 

. Did his mood undergo change during the testing? 

- Was he, on the whole, cheerful? 


» In what mood was the individual when he left the testing 
room? 7 


oo po N 


Tests of Mental Impairment 


In this section, 
of several highly specializ 
the clinical study of pati 
junction with other inst 
Wechsler scales, 


we present brief descriptions and evaluations 
ed testing procedures developed primarily for 
ents. Where used, they are employed in con- 
ruments, particularly the Stanford-Binet, the 
and projective tests. 


The Bender Visual-Motor Gestalt Test consists of nine figures char- 
acterized chiefly by their patterning (that is, their Gestalt). 'The subject 15 


simply instructed to copy each figure, without time limit. The test, clearly, 
is not one of visual memory and imagery; it is one of perception and 
visual-motor functioning. 

The figures used were devised by Max Wertheimer—one of the found- 
ers of the Gestalt school of 
ception. The underl 
pounded by Werthei 


psychology—in his experimental work on per- 
ying principle utilized in the Bender test, as ex- 
mer and others, is that organized wholes (structured 
I am indebted t 


1 f 
o my former graduate student and assistant, 
collaborated impor! 


Dr. Joanna Byers, who 
tantly in the development of this form. 
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units) are the primary forms of perception in man. Loss of integrative 
perception, therefore, might be a psychopathological manifestation. Per- 
ceptual behavior is regarded, in the test, as involving sensory reception 
of the figures, interpretation at the central levels of the nervous system, 
and motor performance (drawing). This total process of perception and 
reproduction can be distorted by neural injury, by emotional maladjust- 


ment in the perceiving individual, 
and by variations in the intellectual 
level of performance. Hence, the 
test's possibilities were explored by 
investigating "gestalt functions" in 
cases of aphasia, organic brain dis- 
ease, schizophrenia; manic-depressive 
psychoses, mental defectives, psycho- 
neurotics, malingerers, and normal 
children. 

Pascal and Suttell (46) provide a 
standardized and quantitative system 
of scoring the reproductions of 
adults. Essentially, the scoring pro- 
cedure was arrived at in this manner. 
(1) The reproductions of psychiatric 
patients were compared to those of 
normal persons. (2) The drawings of 
abnormal persons tended to deviate 
from the originals more than did 
those of normal subjects. (3) Devia- 
tions (differences between originals 
and reproductions) that discrimi- 
nated between normal and psychiat- 
ric subjects were isolated. (4) ^ de- 
Viation was retained if item an 
icantly between the two groups 
abnormals but “practically never 
weighted according to discrimin 


"Score reliability" was determined (r = 


found for a nonpatient population of 474 SU 
most of who 


varying in age from 15 to 50 years, 


classes at the high school or college level. 
d been evo 
matching of scores with group classi- 


) and (2) prediction of improvability 


After the scoring method ha 
validity was studied by (1) “blind” 
fication (normal, neurotic, psychotic 


alysis showed th 
or if it occurre 
' in those of normals. (5) Deviations were 


ative value between the two groups. (6) 


OO 


A 


° 
© o 00000 0 O0 
0000000009 


Fic. 14.5. The Bender Visual-Motor 
Gestalt Test. A visual-motor Gestalt 
test and its clinical use. Research 
Monograph, American Orthopsychi- 
atric Association, 1938, no. 3. 


at it discriminated signif- 
d in the reproductions of 


0; N= 120). (7) Norms were 
bjects (271 males; 203 females) 
m were attending evening 


Ived and the norms determined, 
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iss was 
of patients receiving therapy, who were tested on ks Ra = E 
fairly effective in distinguishing between normals and la 2 oa Qa 
tween normals and neurotics; but it discriminated € rear Sey unen 
psychotics and neurotics. As between paent who ta MANDAR I 
discharge, as "improved" or "unimproved, under (2), opi i 
the two groups showed significant differences. The resu : pore 
(1) and (2) were sufficiently discriminating to encourage the dediat 
Bender-Gestalt as one possible source of significant evidence in 
nosis of normal and abnormal personalities. We ne 
As in all such situations, it must be noted that althoug ; : 
scores and standard deviations for the groups are significantly E 
there is still appreciable overlapping of scores of the three iis Puch 
mals, neurotics, psychotics). Overlapping of scores within differen E ga 
classified for various purposes, is the usual psychological phenomen oe . 
dealing with an individual, therefore, in respect to a particular tr rt 
function, it is always necessary to consider the possibility that he mig 


3 a 
deviate from the central tendency of his group. The scores in Table 14 
illustrate this point. 


TABLE 14.1 


BENDpER-GrsTALT TEST MEANS AND RANGES OF SCORES 


SS MC S Do xe te n 


Meanscores* ^ Middle 60% Range 


Normals 50 47-60 32-79 
Neurotics 68 53-80 32-1 39 
Psychotics 81 65-100 40-155 


* Values are in terms of Z Scores; mean equals 50, standard deviation 
equals 10, The higher scores are t 


he more unfavorable. Based upon 
data in Pascal and Suttell (46, pp. 30-31). 


: aks : A at 
More recent researches have provided promising evidence showing th 
this test may be used dia 


gnostically with emotionally disturbed racks 
(11), that it differentiates organics from other categories (61); and tha 
it has been used with some success to differentiate between delinquents 
and nondelinquents (67). On the other hand, some findings are jncon- 
sistent and equivocal in studies on differences between psychotic and 
nonpsychotic groups (60). 

The Babcock test is based u 
tellectual impairment or dete 
by using his vocabula 
Prior to the onset of 


pon the widely employed principle that in- 
rioration of an individual can be estimated 
Ty score to represent his mental capacity as it was 
mental disturbance (s, 88). The individual's scores on 
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the several other subtests are compared with the scores made on the same 
subtests by a normal group of the same “vocabulary age." When this 
technique is used, the implication is that one's vocabulary withstands best 
the adverse influences of advanced age and mental illness, and that vo- 
cabulary score may, therefore, be regarded as an index of the affected 
person's previous intellectual level. His performance on other subtests is, 
therefore, compared with the performance of normal persons of the same 
vocabulary level, on the principle that in a normal population there is a 
high correlation between vocabulary scores and those of other mental 
functions. 

The Babcock test provides a measure of mental "efficiency." This index 
is based upon tests emphasizing information, recall of meaningful ma- 
terials (nonrote) rote memory (meaningful and meaningless materials), 
motor 'speed (rate of writing, tracing, digit symbol substitution), and 
simple learning (immediate reproduction of paragraph, paired asso- 
ciates, drawing designs from memory). The subtests are arranged into 
several groups; the score for each group is the average of the subtests that 
constitute it. The “Efficiency Score" is the total of the group averages. The 
subject’s “vocabulary age” is based upon the score he obtains on the 
vocabulary test of the 1937 Stanford-Binet. 

An individual's rating is derived in this way. (1) The vocabulary score 
is converted into a "vocabulary age.” (2) The expected average level of 
performance on the other subtests, corresponding to this vocabulary age, 
is found in the given table of norms; this value is called the expected 
average. (3) The actual scores are averaged; this average; called the total 
efficiency score, is the obtained average. (4) The expected average is sub- 
tracted from the obtained average, yielding an efficiency index, either 


positive or negative. : i i i 
The efficiency index has been used as a diagnostic device. For instance, 


in a group of paretics, an average index of —4.8 was found. The size of 
the index corresponded in general to the degree of mental deterioration. 
A group of schizophrenics had a median index of —3-5- Results obtained 
by other investigators agree, in the general trend, with those reported by 


Babcock for abnormal groups. 
Efficiency indexes, presumabl 
averages. Inspection of the Bab 


y characteristic of the several groups, are 
cock and Levy table of norms shows that 
while the differences between median indexes of the several clinical 
groups may be reliable, there is still a large amount of overlapping of 
scores between any two groups- This means, of course, that the efficiency 
index is not in itself adequate for diagnosis and classification, but must 


be used as one of several converging lines of evidence. - 
Some clinical investigators have gone beyond the use of therethciency 
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index alone. They have sought patterns of impairment and performance, 
the principle being the same as in the case of the Wechsler scale. oe 
phrenics, for example, perform worst on tests of immediate and delaye 
recall of meaningful materials, their responses being characterized not 
only by a small number of correct recalls but also by serious distortions 
and introduction of bizarre material (51, p. 379 ff). However, on rote 
repetition of digits schizophrenics do relatively well. This difference þe- 
tween the two types of recall is regarded as significant. Since schizo- 
phrenics maintain their efficiency on some types of tests, but are markedly 
inefficient on others, total scores will suggest only an intermediate degree 
of impairment, which is inconclusive and can be misleading. The impor- 
tance of subtest analysis, both quantitative and qualitative, is thereby 
illustrated. 

This instrument is intended to measure mental functioning regarded 
as “crucial in the maintenance of stability in our present social environ- 
ment.” Babcock and Levy; therefore, have attempted to isolate aspects of 
intelligence that remain stable under most mental disorders and that 
measure level of “abstract ability” (in actuality, vocabulary). Other men- 
tal processes, however, are impaired during mental disorders. Therefore 
the tests are designed to measure impairment of efficiency in the follow- 
ing functions: speed of perception and response in simple well-learned 
experiences, and in the less familiar and more difficult; rate of learning; 
psychomotor responses, memory in several stages and aspects; and smooth- 
ness in sustained effort. 

Not all psychologists agree that these tests measure the functions named. 


For example, some maintain that the tests measure, among others, atten- 
tion, concentration, coherence of thou 


ght processes, and visual imagery- 
Nevertheless, 


regardless of the final determination of the precise functions 
measured, these tests have demonstrated their usefulness for clinical pur- 
poses when the numerical index is supplemented by an analysis of specific 
Impairment and a qualitative statement concerning noteworthy aspects 
of the subject’s performance. 

, , Tests of Concept Formation. These tests are based upon the 
principle that emotional disturbances and personality disorders interfere 
with thinking processes, particularly with ability to form abstract con- 


cepts. The purposes of these tests are, therefore, to help the psychologist 
observe the subject’s thought processes and to discover the extent to which 
maladjustment or menta 


t l illness has impaired his conscious thinking, as 
revealed in efforts to solve problems requiring the formation of concepts. 

In particular, these tests are intended to evaluate the subject's ability 
to deal with Objects and situations on the abstract or conceptual level, as 
compared with the concrete. Ability to form concepts implies conscious 
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reasoning at the abstract level; that is, 
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transcending the immediate specific 


sensory situation, abstracting the common property from particular in- 
stances, analyzing and synthesizing, shifting from one aspect to another, 


keeping in mind several aspects simultaneously, planning ideationally, 
and self-criticism. An individual's behavior at the concrete level, on the 
other hand, lacks these characteristics. The individual is then unreflective; 
he responds to the immediately given object or situation as something 


unique; he does not percer 
general class or category. 


ive an object or situation as one instance ofa 


The Goldstein tests are of several kinds. A Cube Test (devised in col- 
laboration with M. Scheerer) estimates the subject's ability to copy 


colored designs, using variously c 
with one exception, those of the fami 


reproduction of designs may be achieve 


abstract approach. In the former, 
as a whole and attempts, wit 
latter approach, the subject perce 


to reproduce it while employing deliberate analytic 
he relations of the parts of the designs (presumably, 


toward a discovery of t 


the principles underlying its construction). 
the design correctly on the first attempt, 


design in a graded series of modified forms, 
ing form. Each step is inte 


to apprehend than the preced 
the solution through: (1) enlarge 
ment of the model to actual block 
size; (2) emphasis upon delineation 
of part relations; or (8) use of actual 
block models. 

Abnormal subjects, it was found, 
did not benefit from the presentation 
of graded concrete aids, whereas nor- 
mal subjects did. Those in one group 
were unable to learn; the other 
group did learn to solve the 
at the level of abstraction. For the 
normal group, the aids are said to be 
a means of learning to succeed at the 
abstract level on later designs. For 
abnormal persons, the aids are con- 
crete presentations without possibil- 
ity of transfer value to subsequent 
situations. If the subject being tested 


problem 


olored cubes. The twelve designs are, 
liar Kohs series. It is maintained that 
d through either a concrete or an 
the subject merely perceives the model 
hout analysis or reflection, to copy it. In the 
ives the design reflectively and attempts 


al reasoning directed 


If the subject fails to reproduce 
the examiner presents the same 
each of which is less difficult 
nded to facilitate 


Goldstein- 


Fic. 14.6. From the 
Scheerer Cube Test. The subject is 
uired to construct this pattern 
from a set of blocks, variously col- 
ored. The Psychological Corpora- 
tion, by permission. 


req 
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cannot benefit from the modified and simplified aids, impairment of ab- 
stract behavior is indicated. 

A Color-Sorting Test (A. Gelb, collaborator) is also used to estimate 
levels of abstract and concrete behavior. This requires the use of color 
concepts, if the response is at the level of abstraction. In one test, woolen 
skeins of different hue and tint are presented at random, and there are 
twelve different shades of each color hue. The subject selects one and is 
then asked to pick out all others that go with it. In the second test, three 
skeins are presented; two are of the same hue but different in brightness 
and saturation; the third is of a different hue, but the same in brightness 
as one of the first pair. The subject is expected to make a selection either 
according to hue or brightness. 

Gelb and Goldstein report that abnormal persons with functional be- 
havior disturbances are incapable of using abstraction. In varying degrees, 
they appear to be unable to shift from concreteness of matching to the 
abstract thinking required in selecting and classifying. 

The third in this group is an Object-Sorting Test (A. Gelb, E. Weigl, 
M. Scheerer, collaborators). This consists of a group of about thirty ob- 
Jects common to everyday experiences, one group for males and one for 
females, Its purpose is “to determine whether the subject is able to sort 
à variety of simultaneously presented objects according to general con- 
cepts.” Ability so to classify is evidence of the abstract approach; inabilit; 
is evidence of the concrete approach. Classifications may be made on tie 
r, form, materials, situational membership (im- 
plements for setting a dinner table), and pairings. 
orm Sorting Test (Scheerer and Weigl, col- 
me principle as that in the others of this group. 


uri ; an inabili 
meaning of sorting; dep 
shift voluntarily from 


A Sti i i 
ick Test completes the series (in collaboration with Scheerer). This 
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is intended to examine the subject’s ability to: (1) copy relatively simple 
geometric figures composed of sticks that are 3.5 and 5.5 inches in length; 
and (2) reproduce these same figures from memory, after exposure of five 
to thirty seconds. The sequence in which the thirty-four figures are pre- 
sented represents something of a scale in terms of number of sticks in- 
volved and increasing complexity of the model. At the level of abstraction, 
the subject is expected to survey the pattern and analyze it into its parts 
and part relationships (spatial and directional). Reference to the stimulus 
figure as being associated with, or representative of, an object in actual 
experience is regarded as a concrete response. Thus, to reproduce the 
figure A and call it a "roof" is to perceive at the concrete level. It is re- 
ported that this test is particularly suited to cases of marked mental de- 
fect or deterioration. 

The authors of this series of tests do not present the type of statistical 
information upon which evaluations are usually based. In fact, there has 
been no attempt to standardize their tests with respect to the usual 
criteria of validity and reliability; nor are scoring standards provided. 
Their emphasis is placed entirely upon qualitative evaluation of the 
subject's responses as an aid in the diagnosis of mental impairment or 
of mental arrest. There is no doubt, however, that the clinical value of the 
tests would be enhanced if quantified or scaled ratings could be had, not 
necessarily to obtain percentile or standard scores or the like, but to 
facilitate an over-all estimate of an individual's responses and to increase 
the objectivity of the tests’ interpretation. At the present time, these de- 
vices appear to be valuable in the hands of an experienced psychologist 
for detecting cases of marked intellectual impairment or arrest, by pre- 
senting tasks involving abstraction; tasks which are rather simple for per- 
sons of average ability who are functioning normally. In fact, the difficulty 
levels of some parts of this series are such that most 7-, 8-, or gryear-old 
children can achieve a solution at the abstract level. Hence, in most in- 
Stances, the failure of an adult readily to offer a solution at the abstract 
level may well be regarded as a significant symptom of serious loss of in- 


tellectual level and efficiency- 


The Hanfinann-Kasanin test 8 consists of twe l 
in one of five colors, six shapes, tWO heights, and two widths. The prob- 


lem for the subject is to discern how the blocks may be divided into four 
categories; tall-wide, flat-wide, tall-narrow, and flat-narrow fgura 
The subject is shown the entire set randomly arranged. The examiner 
selects one as a sample and directs the examinee to pick out all others 
that are of the same kind. It is obvious that the subject might at first 
make his selection according to color, shape, or size. Each type of block 


5 This test is a modification of the Vigotsky tests. Sec E. O. Miller (44). 


nty-two blocks, each block 
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nse name on the bottom, concealed from the jus as gor = 
a qe r flat-wide). After each grouping, the examiner sho ee 
EXER p of the wrongly selected forms by revealing that it a e 
m. The procedure is to continue this kind of aid unti 
l . 


i i ifiti all possible for 
subject discovers the predetermined classification, if it is at all p 
him to do so. 


Fic. 


14-7. Hanfmann-Kasanin test blocks, C. H. Stoelting Com- 
pany, by permission. 


Performance is anal 
task, nature of the att 
tion. In each of these 


; and the conceptual—these being scored P 
and 3, respectively, The scoring method is based upon the nature of th 
Subject's approach to the 


ability to verbalize his pe 


—R 
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d manera dee DR E and Kasanin were concerned prin- 
phrenics an their clinical differentiation. Rapaport 
(51, Chapter 4) found their test clinically useful in obtaining evidence of 
the subject's modes of thinking and response to difficult and frustrating 
problem situations, rather than for diagnostic differentiation. For ex- 
ample, different individuals may show the following modes of response 
i varying combinations and degrees: fluidity (lack of direction); flexi- 
bility (varying the approach, but keeping the end in view); rigidity (re- 
sistance to modification of behavior); persistence (continuity of behavior). 
These qualitative descriptions, therefore, obtained with the Hanfmann- 
Kasanin test, are of the kind that are valuable as a supplement to nu- 
merical ratings in obtaining a more nearly complete description of a per- 
son’s mental functioning. 
Although the authors themselve 
the assigned values have not been 


s used a scoring plan, the weights of 
experimentally determined. They are, 


rather, arbitrarily assigned scale values; they may, therefore, be more 
properly designated as numerical indicators. As such, they are useful in 
obtaining an over-all evaluation of an individual’s performance on the 
test. 

The problem situations presented in the test are at a more difficult and 
higher level of abstraction than those in the Goldstein series. Since these 
problems make greater demands upon persons of higher mental levels, 
they may reveal deterioration in conceptual thinking that would still be 
unapparent in the subject’s long-established responses for meeting fa- 
miliar situations and for dealing with familiar problems. The Hanfmann- 


Kasanin test provides a means of ior in a controlled situa- 


observing behavi 
tion and of obtaining information of some significance, to be added to 
other psychological data. 


The Hunt-Minnesota Test for Brain Damage was devised as an aid 
in detecting organic brain damage. It is intended for use with individuals 
16 years of age or older. The instrument consists of these three major divi- 
sions: the Vocabulary test of the 1937 Stanford-Binet, which has been 
found relatively insensitive to brain damage; six memory and recall tests, 
considered to be sensitive to brain damage; and nine interpolated tests 


that serve as "validity indicators." , $ ; 
orizing and retention of paired 


'The six deterioration tests involve memi 
designs and of paired words presented orally. Both types are used to test 


immediate and delayed recall. A series of paired designs is exposed, with- 
out interruption, for six seconds each, after which the subject is shown 
one of each pair, in sequence, and is required to identify the design asso- 
ciated with it (immediate recall). In the word test, a series of ten pairs of 
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words is read; after this, the first word of each pair is given po 
is the subject's task to name its parea [qun (immediate recall). 

f designs and three of words. ' Te 
mo m ou as in the case of other dinical AREA. iie 
garded as a measure of a relatively stable function that holds up a t 
deterioration. The vocabulary score, which is the number of wor dd 
rectly defined according to the Stanford-Binet standards, is taken ae 
base for the determination of the deterioration score, from which pres 
or absence of brain damage is inferred. 1 s 

The interpolated tests fais of the following: information; dede 
the months of the year; counting from 1 to 20; counting from E] oe 
by 3s; tapping on the table every time the number g is read in d = 
series of digits (attention test); counting backwards from 25 "di ; ne 
peating digits backwards; naming the months in Ere order; i (a 
traction of 3s from 79 to 1. These items are included as validity in 
cators,” since persons unable to perform 
too disturbed, or too deteriorated to bi 
not yield valid results, Crit 
polated tests, these having 
brain-damaged persons us 
reports that individuals 
three or more of the inte 


them are too uncooperative, or 
€ tested; hence testing them will 
ical scores are provided for each of the inter- 
been reached, or exceeded, by go percent of the 
ed in standardizing the test. The test's author 
whose scores fall below the critical levels in 
rpolated items cannot be validly tested. 


in several state hospitals upon a small num- 
ber of patients (only thirty-three) 


amage. 


With other evi 
The correla 
found to be 


dence in each case. 
tion between det 


erioration scores and vi 
—-51; between ag 


ocabulary scores was 
e and vocabulary, .07; 


between age and de- 
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terioration score, —.37. Multiple correlation of deterioration score witl 
age and vocabulary was —.65. 

This instrument, like all others in this difficult area of psychological 
testing, needs to be more clearly validated by further application and ex 
perimentation. It is well, therefore, to view it at present—as suggested by 
its author—as additional evidence in reaching a conclusion as to the 
presence or absence of brain damage. At the same time, the more extreme 
deterioration scores have greater diagnostic value. Although few relatively 
recent research studies have been published on this test, available data 
indicate that it is clinically useful, if its results are carefully analyzed, 


qualitatively as well as quantitatively (65). 
In addition to the instruments briefly d 

cent devices have been made available; but these are in rather early ex- 
perimental stages. One of these is the Grassi Block Substitution Test, 
which requires the subject to construct patterns, the designs of which are 
presented to him (20). Other experiments have been made with sorting 
of words, cards (14), and geometric forms (66). These newer experimental 
efforts, however, utilize essentially the same basic psychological principle: 
as those employed by their predecessors. 

Evaluation. Psychological tests in this area are based upon the 
principle that old, well-established habits and modes of behaving (such as 
word knowledge) show relatively little loss, whereas new learning, newly 
acquired associations, performing new tasks, and solving new types of 
problems are impaired in cases of brain damage and other forms of 


mental disturbance. ae j 
As a group, these tests have not been adequately standar ized, so fai 
erned. Standardizing test: 


as norms, scoring, and interpretation are conc a n 
for extremely deviant and often uncooperative populations is, however 
an extraordinarily difficult task. Since these tests are intended primarily 


for adults, their standardization and interpretation are further compli 
cated by the fact that allowance should be made for normally expectec 


loss at advanced ages; and individual differences in cultural, educational 
must be considered in evaluating a sub 


and ional backgrounds ; : 
occupationa B is made the more difficult, too, since psy 


ject’s performance. Validation 2 52 
chiatric diagnoses and classifications are generally used as PESE oa 
of validity; and these are themselves sufficiently inconsistent and fackin} 


in reliability as to introduce an important source of error in the valida 


ME. do not, as yet, provide a self-sufficient method of measurin 
mental deterioration, except in the more marked cases. However, clin: 
cians who have used them are in substantial gere thar meya 
valuable in providing opportunities for observation of mental operation 


escribed above, other more re: 
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under controlled conditions in which prescribed materials are used. In 
such situations, an experienced psychologist is able to make a Sera 
qualitative observations, in addition to deriving, at times, quM 
values for the subject’s performance. The particular qualitative piger 
tions that can be made will depend upon the content and technique « 
the test. Qualitative observations might include descriptions of the sub- 
ject's thought processes, estimates of levels of abstraction or concreteness, 
bizarre responses, degree of self-criticism, fluctuations of attention, sd 
grees of rigidity or flexibility of thought processes, level of immediate and 
delayed recall compared with recall of remotely learned materials, and 
evaluations of the subject's performance in the light of his former educa- 
tional and occupational status. To make these observations, of course, 
requires a background of experience with a sufficiently large and varied 
number of subjects, including, for comparative purposes, persons within 
the normal range of behavioral adjustment and performance.® 
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NONVERBAL GROUP SCALES OF 
MENTAL ABILITY 


Beginnings 


e , is 
hence they are called individual scales. er 
mance scales already described. Individu 


5 r i be 
onsuming and require that the examiner 
ring them, i 


duction, and by a 
psychologists under 


World War I P 


Broup test. Prior to 1917, psychologists had been experimenting with test 
items and organiza examination. Shortly after 
the United State a psychological branch We 
formed in the 4 group scales for the genera 


and the Yerkes Point Scale were 
ndividual performance scale, the 


employed to some extent, as well as ani 
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main task in the army was one of testing large numbers of men in a short 
space of time. Consequently, the Army Alpha scale (verbal) and the 
Army Beta scale (nonverbal) both group scales, were organized. "These 
were actually the product of the contributions of individual psychologists, 
notably Arthur S. Otis, who pooled their experience, experimental re- 


sults, and resources. 
About 1,750,000 men were tested in the army of World War I. The 


scales were not highly satisfactory instruments; then, too, the men were 
often examined under unfavorable conditions. However, the results were 
of some assistance in the selection of men for advanced or special training, 
on the one hand, and of men of such inferior ability as to be unsuited 
for military training, on the other. 

The use of psychological testing in World War I had many outgrowths, 
some of which were unforeseen by the psychologists themselves. The data 
were reported and analyzed in a huge volume (31). On the basis of these 
data, many periodical articles and books appeared on such subjects as 
racial and national differences in intelligence, geographic differences in 
intelligence within the United States, differences between occupational 
groups, relationship between educational status and intelligence, and the 
general intellectual level of the American adult. Not only were many of 
these data of doubtful validity, but some of the interpretations and pub- 
lications based upon them gave rise to serious misapprehensions in regard 
to these problems, which are loaded with social and educational implica- 
tions, Another result of psychological testing in the army was the impetus 
it gave to the development of group tests for civilian purposes, notably 
in educational work at all levels, from kindergarten through university. 
Also, it set a precedent, for in World War II psychological testing was 
conducted on a vast scale in all departments of the armed forces. 

The types of test materials included in the army group scales and 
in the numerous group scales subsequently developed were not all inno- 
vations. For example, tests of memory, sentence completion, free and 
controlled word association, arithmetical computation, vocabulary, classi- 
fication of objects, and following directions had been in process of ex- 
perimentation in the United States and Europe, during the last quarter 


of the nineteenth century. 


Characteristics of Group Tests of Mental Ability 

Most group tests—implicitly or explicitly—are constructed on 
the principle that intelligence is a general capacity and that it should 
be measured by sampling a variety of mental activities. Inspection of the 
scales shows, therefore, that they include, in various combinations, such 
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i j ent 
items as following directions, arithmetical problems, ae 
(in connection with “common-sense probi miy Er Mrs abes 
arranged sentences, completion of number series, co Pi P ES i 
verbal analogies, information, mazes, three-dimensiona + vatiles, picts 
counting of cubes, digit-symbol combinations, picture absu > d mh 
arrangement, geometrical construction ("paper eio i à ber Rue 
metric pattern analogies. Samples of each of these will be p 
when some scales are described. 2 

In most Broup scales, the items of each type (for ie me 
series) are placed together in separate subtests or parts, m nup dé 
the easiest and progresing by intervals—as nearly equa A edes 
achieved—to the most difficult. The principle involved here is a per 
By means of such an arrangement of items, every individual ep piles 
the test is intended should be able to get some items correct and p 
to his level of maximum difficulty in that type of mental activity. mie 

It will be found, however, that items in a scale are arranged, at -— 
in "spiral omnibus" fashion; that is, items of various types are Les s 
in regular or irregular order, instead of being grouped separat 4 i 
subtests. Thus, there may be a sequence of this kind: one item eac di 
number completion, arithmetical problem, vocabulary, information, a 


2 4 B esci bs eated 
ogies, etc.; then the same kinds of Items, increasing in difficulty, rep 
in the same or in a different order. 


Every group scale is standardized fo: 


grades. Thus, the particular types of i 
within a scale will de 


r a specified range of ages or ce 
tems used and the levels of dE) 
pend upon the group for which the scale As 
tended. For example, a group scale designed for children from ae in 
garten through the second Brade will be almost entirely nonverba i 
character, except for directions; one designed for.pupils in the Eh 
mediate grades will include an increasing portion of abstract and Ln. 
ceptual items (verbal and numerical); and tests of intelligence for hig 


: e 
School pupils and college freshmen are largely, sometimes entirely, of th 
verbal and numerical kind. 


On many group scales, an i 
of the num 


Tange of point scores. 
the deviation 1Q. 
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Group scales are scored more rigidly and more objectively than those 
individually administered, such as the Stanford-Binet.! In the former, the 
correct response or responses are supplied for each item, so that they can 
be scored by clerks or machines. In the case of the Stanford-Binet and 
similar scales, although specimens of satisfactory and unsatisfactory re- 
sponses are supplied, it is frequently necessary for the examiner himself 
to evaluate some and decide whether or not credit should be given. This 
necessary exercise of judgment, however, does not invalidate the scales; 
correlational studies have shown that there is close agreement between 
experienced examiners as to the scoring of given responses. 

Most group scales impose time limits for each of the several subtests, 
or parts. Whether this fact makes a scale a test of speed of response, 
solely or largely, or whether the scale measures “power” (level of difficulty 
the individual is capable of reaching) is a question to which answers have 
been provided by experiment. The imposition of time limits does not 
necessarily make a scale a test of speed of performance; the significance 
of the speed factor, in affecting the total score of a person, varies with 
the scale used. " 

Some group scales are entirely nonverbal in content; others are entirely 
verbal; still others combine the two types of items. In this chapter we 
shall describe several representative scales of the nonverbal variety. 


Tests for Kindergarten and Primary Levels 


P scales are devised for use within a limited range of 


ages or school grades. A few of these instruments are intended for use 


from the kindergarten level through the second or third grade. It is pos- 
sible to give a group test at these levels if the number is small enough 


(a maximum of twenty) to permit close observation by the tester and their 
teacher, to be assured that the children are paying attention, following 
instructions, and not having accidents that might affect their scores ad- 
versely. Except for instructions, the tests at these age levels are entirely 
nonverbal in some scales; in others a relatively small amount of verbal 
and number-test materials is introduced at the third grade level (for 
example, Kuhlmann-Anderson tests). At the early levels, especially PER 
garten and first grade, it is important that the child's Mu should 
not require skill in writing or using à pencil. Responses, therefore, are 


made by a mark (a circle, straight line, or an X). 


Most grou 


rior instruments. In fact, inflexible 


les are supe tan: 
s purpose is to study an individual clinically 


well as quantitatively. 


1 This does not mean that the group scal 
scoring is a disadvantage if the examiner 
and to analyze test results qualitatively as 
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i he earl 

The Pintner-Cunningham Primary Test (1923-1946) is à es bes 

and well-known scales in this group.? It will be nae em : A apr 
the contents of all tests at this level are similar in form ar y 


Š i ven 
items, although they differ in details. It includes the following se 
subtests. 


Picture Completion 


Dot Drawing 


Fic. 15.1. Items from the P. 
mary Test. Copyright 1946 
World, Inc. Reproduced by p 


intner-Cunningham Pri- 
by Harcourt, Brace & 
ermission. 


Identification of ass of objects that belong to- 
gether (for example, lock and key, star and quarter-moon), J 
Discrimination of size: matching items of clothing with a pictured girl. d 
Perception of the Parts of a whole: analyzing a series of pictures of in- 
creasing complexity. A "stimulus picture" is given; adjacent to it are a nume 
ber of parts of that Picture. The task is to mark those parts that constitute 
the whole, 
Picture completi 
and marking, f 
Copying designs 


ination (1936-1947) is another early and 
i r-Cunningham, it was de- 
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types, though familiar in general psychological testing, are not included 
in the other scales at the primary level, described in the summary that 
follows. 

Digit-symbol: the familiar test. 

Perception of similarities and differences of objects: crossing out the one 


object that is of a different class from the others in the group. 
"Three-dimensional visualization: counting the number of block 


picture. 


s in a 


Block Counting 


ONV G esu 


“Paper Form Board" 


d Ez Bed GES aos ox 
Eg Prag DERN Beas eS 


Matching Figures 


Picture Sequence 


the Chicago Nonverbal 


Fic. 15-2. Items from 
ychological Corporation, 


Examination. The Ps 
by permission. 


Picture Arrangement 


Paper formboard: marking the parts that constitute a whole geometric 


figure. 
Visual perception of 
or less internal detail. 


detail: matching geometric designs that have more 
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indi ust be 
Picture arrangement: numbering parts to indicate how they m 
placed to form a whole. 


i i i i epresent 
Logical sequence: numbering pictures in their correct order to rep 
a sequence of events. 


Picture absurdities: noting missing or superfluous parts. 
Picture matching: relating a part to a whole picture. 


y: a second subtest o; this type, but more complex than the 
Digit symbol: d sub f thi , bi l h 
first. 


E E s imilarities and 
Their subtests are described briefly below to illustrate similarities 
differences in content, 


California (grades 1, 2, 3): 


Memory: immediate and delayed, 
Spatial relationships: sensing ri 


category in pictures, 


Number Concepts: counting dots in Squares and 


writing the number. 
Digit-symbols: the familiar task 


ie ROME A imple 
of writing a digit in various simpl 


letters of the word, 


Substitution: Substituting letters of the alphabet for Series of digits, each 
of which will spell out a word (Fig. ı 
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Examples 
(Ay 3 Gus 
(B 8172 
Q) 536 
(2) 915 
(6) 9362 
(7) 4215 
(41) 425134 
02 5241379 


Fic. 15.3. From the Kuhlmann-Anderson Intelligence 
Tests. By permission. 


Perceptual analysis of geometric patterns: completing each partially 


drawn figure to match the other member of the pair. 
Word classification: eliminating the one word that does not belong in a 


series of five, the other four of which are in a single category (for example, 
toys, food, plants). 


Lorge-Thorndike (grades 2 and 3): 
Identification: pictured objects, animals, and humans are identified by 


name (read out by the tester). 
Classification: categories of pictured objects, animals, and humans are to 


be perceived, the irrelevant one being marked out. 

Associated objects: two of five drawings in each row are to be marked, 
being related according to a common characteristic (for example, musical 
instruments). 

The SRA Tests of General Ability consist of a graded series of scales, 
each intended for three grade levels from kindergarten through the 
twelfth. The scale for kindergarten to grade 2, and the one for grades ES 
to 4, have two parts, identical in form and type of materials, but differing 
in content and difficulty. The two subtests are the following: 


Identification of familiar objects: placing an X on one of five items, as 
directed by the examiner (for example, "Find the thing that is used to hit a 
ball."). 1 

Perception of similarities: identifying one drawing in five that does not 
"follow the rule" of that particular row. 
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Evaluation. As in the case of most carefully constructed group ier 
those described have quite satisfactory reliability. The Siete ae 
ham coefficients vary from .88 to .94; for the Chicago Nonverba 3 a 
range between .80 and -go. The California test manual reports re n A 
ities at .go or above. The Kuhlmann-Anderson reliability mie 
the third-grade level is .91; for the Lorge-Thorndike, at grades 3 ae E 
the reliabilities are -90 and .86 (retest with alternate forms)? The med T 
of the reliabilities of the Tests of General Ability is .87. On the who " 
therefore, it may be said that these scales have demonstrated their x 
liability at an acceptable level. In examining the soundness of a scale, 
however, one should look, also, for data on standard errors of the scores. 
The Kuhlmann-Anderson, for example, reports a mean IQ ree 
of less than 2 points between the first and second testing; but the : 
of the differences is 9 points. Flanagan reports a standard error of abou 
7 1Q points “for IQs around 100," for the Tests of General Ability. » 

In studying their validity, the several authors have used the fari ia 
criteria: correlation with the Stanford-Binet and with measures of schoo 
achievement 


by known groups (especially differentiation of the mentally retarded 


i amination provide a case in point. 
This scale found a mean IQ of 61 (SD = 12) for ninety-nine mentally 


» as compared with a mean of 62 (SD = 6) obras 
; erence between the Stanford-Bine 


given). Thus, although the a 
dard deviations and the mean ° 


pancies are p robably attributable to the differences in 
Content of the scales; and they Suggest, further, that the two scales should 


that an adaptati 


* on of th. lit- 
ing the odd-eve € split-half meth 
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be used to supplement each other, rather than as substitutes, when an 
individual's mental abilities are being analyzed and evaluated. 

The SRA Tests of General Ability and the Lorge-Thorndike tests are 
based in part upon the concept of construct validity (see Chapter 5), 
although the manuals provide data on some of the other validating 
criteria as well. The manual of the former states: “The validity of the 
TOGA series rests primarily on its definition of intelligence as basically 
involving verbal and reasoning abilities, and its emphasis on test ma- 
terials that do not require school-learned skills.” The items do not re- 
quire skill in reading, arithmetic, or other specific skills. They do depend, 
however, upon word knowledge and upon familiarity with objects com- 
monly experienced. The Lorge-Thorndike manual states that the aspects 
of intelligence to be measured are: ability to deal with abstract and gen- 
eral concepts; interpretation and use of symbols; relationship among 
concepts and symbols; flexibility in organizing concepts and symbols; 
using experience in new patterns; power rather than speed (see Chapter 7; 
“Definitions and Analyses of Intelligence”). ‘ 

All these scales, especially the more recent ones, meet the technical 
specifications in regard to population sampling, as regards both numbers 
and representativeness. The one type of adequate evidence lacking is 
that pertaining to their validity in predicting future school performance. 
However, since they do correlate well with current school performance,‘ 
we may infer that these scales have reasonably „satisfactory predictive 
value, inasmuch as quality of school achievement in one grade correlates 
well with quality in later grades. Correlation between Lorge-Thorndike 
IQs and Stanford Reading Grade Equivalents was .87; with Stanford 
Arithmetic Grade Equivalents, it was -76. Flanagan reports correlations 
from .52 to .72, with a median of .60, with the Science Research Asso- 
ciates Achievement Series; and .74 to -81, with a median of .78, with 
the Iowa Tests of Educational Development. 


*Culture-Fair" Tests 


The nonverbal tests of intelligence thus far described are in- 
tended for use with young children in the United States who have had 
the advantages, at least, of developing in an ordinary environment. Some 
psychologists and educators have maintained, however, that any test which 
depends upon even as little word knowledge as those already described 
are not "culture-fair" to all children. And some have maintained, also, 
that tests completely free of demands upon language would be desirable 
for testing those segments of the adolescent and adult population who 


* The newer term for this form of validity is "concurrent validity." 
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present. The child has to select, from three statements, the one that presents 
the most probable explanation of what is happening in the picture. 

Picture analogy: This is the familiar type of item in which two related 
objects are shown. The child is required to select a similar relationship in 
a given set of pictures. 

Money: In each item, two sets of coins are shown in three different 
combinations, each of which is incomplete. The problem is to discern which 
of the three combinations will yield a stated sum. 


As is the case with many tests, reliability is often more easily established 
than is validity. Split-half reliability coefficients were from .81 to .84 at 
all grade levels, except the first, where it was .68. This last is too low 
for predictive purposes. The remaining indexes are moderately high, but 
they are not as large as optimally desirable. Test-retest reliabilities, after 
two weeks, were .72 (grade 2), and .go (grade 4). 

Although Davis and Eells do not attempt to establish validity of their 
tests upon correlations with earlier and more conventional scales, they 
present, for informational purposes, some correlations with the Otis 
Quick-Scoring Tests, obtained in grades 3 through 6. Of the sixteen 
coefficients, seven are in the .50s; the remainder are rather evenly dis- 
tributed from .39 (lowest) to .66 (highest). The authors believe these 
coefficients are what should be expected, since they indicate that the 
abilities measured by their tests bear a substantial relationship to those 
measured by the other tests, yet theirs and the others do not measure 
exactly the same factors. 

Correlations with standardized school-achievement tests are also re- 
ported. They are: reading, .43; arithmetic, .41; language, .40; spelling, 
.24. These coefficients are lower than those generally found with the 
more usual types of individual and group scales, but this is to be ex- 
pected because of the nature of the Davis-Eells tests. The authors’ posi- 
tion is that several significant factors, other than problem-solving ability, 
contribute to success in school achievement. 

Validity studies published subsequent to this test's appearance have 
shown quite consistently that it does not correlate as well with school 
achievement nor predict later school achievement as successfully as do 
the earlier and more usual types of test materials. These investigations 
include correlations with standardized achievement tests and with teach- 
ers ratings of pupils’ abilities. Other data indicate that the Davis-Eells 
tends to rate children lower, and that it differentiates as much among 
high and low socioeconomic groups as do the verbal and other nonverbal 
tests, which were not regarded by Davis and Eells as being culture-fair. 

The Davis-Eells test is based upon the concepts of content and con- 
struct validity. The authors' objective was to devise an instrument that 


presents situations and. problems in forms that are within the common 
experience ofall. children ;n the Sz ee, a ee, M rari 
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This meant the elimination of cultural factors—including language (ex- 
cept in the directions, given orally)—that might favor one group and 
handicap another. Although nonverbal tests are not new, the kinds of 
situations presented, excepting picture analogies, are novel. Furthermore, 
the psychological functions to be sampled were determined and specified 
after interviews and psychological analysis, rather than by statistical 
methods. The functions specified are association, insight, reasoning, and 
organizational ability (method of attacking a problem). After the items 
for these tests were developed, each child in a group was interviewed to 
determine whether the problems evoked the mental processes that the 
tests seek to measure. It was found that g2 percent of the children who 
answered analogy problems correctly also explained the analogy relation- 
ship correctly, And nearly all pupils explained the relationships correctly 
in solving the other types of subtest problems. 

Although we may accept the authors’ view that this test measures the 
specified functions, research results have demonstrated the superiority of 
the long-established types of test items that employ verbal and numerical 
materials, and pictorial materials requiring concept formation. 

The Raven Progressive Matrices Tests (1938-1956), developed in Eng- 
land and widely used in the British armed forces during World War Il, 
are nonverbal scales designed to evaluate the subject's ability to appre- 
hend relationships between geometric figures and designs, and to per- 
ceive the structure of the design in order to select the appropriate part 
(from several) for completion of each pattern or system of relations. The 
several scales, for use with persons 5 years of age and older, are regarded 
by the author as being a test of innate eductive ability; that is, a measure 
of the general factor (g). The tests are intended to evaluate the person’s 
ability to discern and utilize a logical relationship presented by non- 
verbal materials. The problems require, in varying degrees, analytical 


and integrating operations of the kind called “insight through visual 
elationships are also possible 


survey.” Verbalization and abstraction of r 
factors, if the subject is able to analyze and synthesize by these means. 
Factorial analysis suggests that the matrices tests are measures largely of 
à “general factor," with a small loading of a spatial-perception factor. 
Raven interprets the first of these factors as being essentially the same 
as Spearman's eduction of relations and eduction of correlates. 

e research in several 


The PM tests have been subjected to extensiv! 
: children, adolescents, and 


Countries and with a wide variety of groups Sh : 

adults, both normal and abnormal. Numerous reliability coefficients re- 

Ported by Raven vary from the low .8os to the low 905. Coefficients re- 
lf method, ranged from 


Ported by other investigators, using the split-ha 


:79 to .go. The differences in correlations are attributable to differences 
nge, mean and range of ability, 


LU : . 
n the constitution of the group: age ra 
number in the sample, and socioeconomic and educational levels. These 
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split-half coefficients are, on the whole, creditable under the circum- 
stances. The test-retest reliability coefficients, however, are appreciably 
lower for scores of the youngest children (below age 7), although with 
older children and adults the test-retest coefficients vary within approx- 
imately the same range as those found by the split-half method. 


2 : 3 

zi 4 
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1G. 15.7. Specimen items from the Raven Progressive Matrices Tests. By per- 
mission. 


Validity of the Progressive Matrices tests has been studied in a variety 
of the usual ways. When the Stanford-Binet was used as the criterion, 
correlations varied from :50 to .86. Correlations with the WISC ranged 
from the low .50s to .91. Most of the coefficients of correlation with 
these two widely used criteria were in the .60s and .7os. The tests corre- 
late as well with educational achievement as do many group tests (verbal 
and nonverbal), but not as high as the Stanford-Binet and the WISC. 

The PM tests were correlated against verbal and other nonverbal group 
scales. Using the former, the resulting coefficients varied rather markedly: 
from .40 to .67. It is particularly interesting to note that the coefficients 
found between scores on the PM tests and those on other nonverbal scales 
(such as Columbia Mental Maturity, Pintner Nonlanguage, Porteus Maze, 
Chicago Nonverbal) are considerably lower (being in the .g0s, .40S, 
and -508) than those found when the S-B and the WISC were used as, 
validating criteria. This fact signifies, among other things, that the Pro- 
aE items come closer to measuring the forms of abstract 
E atg pomi p BENCE tested by these two individual scales than 
ew Bep the functions involved in other nonverbal scales. an 
db A HABI: ype of tests appears to be among the most promising 
Ever ea EA Since it employs only one type of test material, 

: 1t should not be regarded as a substitute for the Stanford- 
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Binet and the WISC, both of which provide the testee with greater oppor- 
tunities for flexibility and versatility of mental activity, and the examiner 
with more sources of qualitative interpretation of responses and behavior. 
But the PM tests are of considerable value as a supplement to the S-B 
and the WISC and as a primary instrument for examining the deaf and 


others who labor under speech or language handicaps. 
The Pattern Perception Test, devised in England under the direction 


of L. S. Penrose, employs a single type of nonverbal material that be- 


o .-[T-Esb edam 
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Fic. 15.8. Items from 
oratory, Universit 


erception Test. Galton Lab- 


the Penrose Pattern P UT 
y of London, by permission. 
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Fic. 15.10. Items from the Cattell Culture-Free Test. The Psychological 
Corporation, by permission. 
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iiie (d pie ed As is generally the case, their reliability coefficients 
P Wlan satisfactory range. But evidence of their validity is not so 
EU tell. gen! test correlates moderately with verbal scales, as does 
Opa Ji it in neither case are the correlations high enough to war- 
M os g these nonverbal tests interchangeably with the verbal. The 
cients for the Cattell, for example, average in the .50s. 
p S PR instrument for culture-free testing is the IPAT Culture 
ra ^h elligence Test (1933-1958), available for three levels: Scale 1 
Apa s: to i and for mentally deficient adults; Scale 2, for ages 8 to 12 
IM P. nse ected adults; Scale 3, for the range from high school pupils 
ed qu adults. Actually only four of the eight subtests in 
eih re regarded as culture-free, since the other four involve verbal 
fout n and information acquired in our culture. ‘Three of the 
E ie ird subtests are of the familiar type: series, classifications, and 
Sis m he fourth, called “conditions,” 1$ said to involve a novel type 
E n pological reasoning," although the functions it tests are the same 
dion other nonverbal tests: visual analysis, reasoning and concept 
on. The subtests are the following. 
o complete a series. 


ant drawing in each row. 
eral, that correctly fits the in- 


Series: selecting one of several drawings t 

Classification: crossing out the one irrelev 

Matrices: marking the one drawing, of sev 
complete pattern. 

Conditions: inserting a dot int 
the structure of which is consisten 
sample design. 

Ren again, reliability coeffici 

y when the split-half method was used 
Place within a very short time. These scales in regard to validity, how- 
a present the uncertainties so common to nonverbal tests. Validity 
n ware given principally in terms of factorial analysis, indicating “satura- 
lon" with the general factor (g). But the scales’ concurrent and predictive 
dies (see Chapter 5) have not been demonstrated. And since test 
ndings, in and of themselves, are not the pri 


mary or major concern of 
a psychologists who use tests, except in certain theoretical problems, 
is deficiency is a serious one. On the other 


hand, studies have shown 
aut these scales are culture-fair, when population samples in the United 
tates, France, Britain, and Australia were tested; for no significant group 
differences were found among these national groups, all of them essen- 
tially of Western culture. In other and dissimilar cultures, the obtained 
ae were significantly below those found in the process of standardiza- 
- If these data are representative, we must conclude that not only is 

" IPAT are the initials for Institute for Personalit 


iate one of several designs, 


he appropr 
"conditions" given in the 


t with the 


e whole, reasonably satis- 


ents are, on th 
and when retesting took 


y and Ability Testing, Champaign, Ill. 
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it highly improbable that culture-free tests can be developed, but that 


mm : a se 
tests should be labeled "culture-fair" only when used in countries who 
cultures are essentially similar. 


Test 1. Series 


2ibisle o 


Test 2. Classification 
1 2 3 4 5 


Test 3. Matrices 


Test 4. Conditions 


Ae] o 


Fic. 15.11. Practice items from the IPAT Cult 
Form B. Copyright, 
Permission. 


2 


ure Free Intelligence ‘Toren 
The Institute for Personality and Ability Testing. 


Evaluation of Nonverbal Group Scales 


Uses. A survey of available nonverbal scales shows, for thé 
most part, that they are valuable primarily with children who have n 
limited educational opportunities or impoverished social backgroun is 
with young children who have not yet learned to read, with older sii ^ 
who are handicapped by reading or language difficulties, and with illi d 
erate or non-English-speaking adults. Possible exceptions to this state 


E E : aes 
ment of limited usefulness are the Pattern Perception and the Progr 
sive Matrices. 


Nonverbal tests are valuable, also, 
who, on verbal tests, have intelli 
75, and who, therefore, 


: ; s 
for the better diagnosis of perso? 
gence quotients between about 60 oe 
would be considered as subjects for special € 
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cational treatment or possibly institutional care. The examining psychol- 
ogist might be in doubt with regard to such borderline cases; but if the 
results of the nonverbal tests confirm those of the verbal, he has reason 
to allay his doubts. However, if the rating on the nonverbal tests is 
significantly higher, then the case will require further study to account 
for the discrepancy. 

Nonverbal tests can be clinically useful, also, with individuals whose 
intelligence quotients are higher than 75; that is, for individuals who, on 
verbal tests, appear to be significantly less capable than there is reason 
to believe they actually are, on the basis of other information about 
them. In this connection, they are particularly useful in population 
centers having large numbers of non-English-speaking people. 

In whatever situation: nonverbal tests areousedi tbe aminen bee 
realize that defective vision or slow psychomotor responses can be a handi- 
cap. The first of these handicaps points up the importance of clear 
drawings—a condition that is not always satisfied. 

Tests of mental ability have had widespread use in schools, where they 
have been utilized for purposes of educational and vocational guidance, 


as well as in the diagnosis of learning difficulties. Nonverbal group tests 
have been found valuable in efforts to determine aptitude and promise 
g architectural drafting, and occupa- 


in shopwork, mechanical drawin, : 
tions of a mechanical or quasi-mechanical nature—all of which make 
demands upon psychological operations that involve geometric percep- 
tions and reasoning with the concrete rather than with the abstract. 
Validity. Studies of the validity of nonverbal scales show that 
although most of them correlate significantly with scales of the verbal 
type (individual and group), the coefficients are far enough removed from 
unity to warrant using the two types as supplements rather than as 
*quivalents. When scores on verbal and on nonverbal scales are corre- 
lated, for children in the earlier grades (approximately through grade 6) 
the coefficients obtained are usually in the .6os and .705, with relatively 
few in the .8os. But when the subjects tested are pupils in the later 
grades, the coefficients usually fall in the 5% and ien with a » rd 
and a few higher. These generally lower coefficients, in the case of pupils 


: REA ilable nonverbal 
in the ] the inability of most ava) 
ater grades, result from individuals in the upper levels. 


tests dde o š on: 

* Pen PR d e d will recall that most, though 
Not all, authors of nonverbal tests of mental ability seek to measure the 
Same mental processes as those tested by means of verbal scales. Some of 
these authors are unequivocal in maintaining that the nonverbal tests 
require essentially the same type of mental operations as those required 
by the abstract symbols of language and number. They hold that the 
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problems presented in diagrams, pictures, charts, and geometric forms 
closely parallel those presented by means of language and number. For 
example: picture arrangement is regarded as being similar in function £0 
disarranged sentences; picture analogies similar to word analogies; pic- 
ture completion similar to sentence completion; reasoning with geo- 
metric patterns similar to reasoning with numbers and words; perceiving 
similarities, differences, and part-whole relationships in pictures and pat- 
terns similar to such relationships in language form. Many nonverbal 
tests, however, suffer from their attempt to assess general ability by means 
of a single type of item or by a limited number of subtests. 

The coefficients of correlation found between verbal and nonverbal 
tests of intelligence demonstrate that there is merit in the view that the 
two types are, in some degree, measuring the same or associated func 
tions. But this does not mean that verbal and nonverbal tests are equiv- 
alent; for one type involves certain functions not involved in the other, 
or one may demand a higher level of the same functions being tested 
than does the other. 

Language and number are symbolic systems that represent something 
else: for example, objects, qualities, events, actions. Development of abil- 
ities in language and number facilitates intelligent behavior, since the use 
of these symbols expands the individual's range of experience beyond 
the limits of the immediate situation. Development of language and num- 
ber makes possible a finer discernment of forms and objects in the world 
surrounding the individual; for with the use of language and number he 
is able to analyze, synthesize, classify, 
jects and events, at first vague, are more sharply defined; likenesses and 
differences are accentuated; evaluations are refined. Language and num- 
ber also enable individuals to organize their thinking into larger and 
more comprehensive, unified patterns. 

Because the use of language and number requires the individual to g9 
beyond the immediate concrete situation, and because he thereby can 
engage in more complex and subtle mental operations, many psycholo- 
gists regard the ability to deal with symbols as a higher form of intel- 
lectual activity than the ability to deal with concrete objects. They prefer, 
therefore, to test intelligence, whenever possible and appropriate, by 
means of verbal and numerical materials. However, they would use non- 
verbal tests when these are made necessary by developmental immaturity, 
or language or cultural handicap, to gain the insights that these tests 
provide if they are adequately scaled in difficulty. 

Cultural Influences. 
aspects of intelligence in many 
given rise to a misapprehensio 


and organize his perceptions. Ob- 


Emphasis upon verbal and quantitative 
of the individual and group scales has 
n regarding the nonverbal scales—that 
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these latter are culture-free. Inspection of the items in the several scales 
reveals that they utilize many objects that children and older persons 
learn about through experiences in their environments. These experi- 
ences are as dependent upon culture as is development of verbal and 
quantitative abilities. The differences are matters of degree of cultural 
influence and universality or near-universality of experience. Conse- 
quently, it is preferable to speak of "culture-fair" tests when referring to 
those whose materials do not handicap or favor any segment of the popu- 
lation for whom the device is intended. 

The presence of cultural influence in a test that appears to be culture- 
free was demonstrated in a study made with several tribes of North 
American Indians (18); the Goodenough Draw-A-Man Test was used. In 
à group of Hopi Indians, the mean IQ for boys was 123, while for girls 
it was 102. Zias also showed appreciable differences in favor of boys, 
Whereas in a group of Navahos, the means of boys and of girls were very 
nearly equal (107 and 110). Sex differences within a tribe are attributed 
to sex differences in training and experience. Boys and girls are trained 
to observe different aspects and details of their environment and are 
taught different types of drawing. The two sexes have different func- 
tions in their group; these functions are reflected in differentiated train- 
Ing; the differences in training are reflected in differences in performance. 

Since every person must develop in an environment of some kind, his 
Skills, information, repertory of responses, modes of thinking, and so on, 
are to some extent culturally determined. Some psychological tests are 
More culture-fair than others. At this point we recall again Binet's prin- 
Ciple that a test of intelligence should be consonant with the milieu of 
those who are to be measured by it. 
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16. 


GROUP SCALES OF INTELLIGENCE: 
ELEMENTARY, SECONDARY, AND 
HIGHER LEVELS 


Some of the early group tests, designed for use from grade 4 
through grade 12—and their corresponding chronological ages—con- 
sisted largely of verbal and numerical items; for example, the Henmon- 
Nelson, National Intelligence Test, Otis scales. Others included, in addi- 
tion, an appreciable portion of nonverbal materials; for example, the 
Pintner General Ability Test and the Dearborn Group Tests. 

The more recent practice in a number of instances is to give the non- 
verbal subtests considerable weight, in addition to the verbal and nu- 
merical, as in the Lorge-Thorndike, the California Test of Mental Ma- 
turity, and the Kuhlmann-Anderson (though to a lesser extent in this 
last scale). The revised Henmon-Nelson (1957), by contrast, is still pre- 
dominantly verbal and numerical, whereas the SRA Tests of General 
Ability (described in Chapter 15) are entirely nonverbal at all levels. 

It is our purpose, in this chapter, to describe representative group 
scales constructed for use at the elementary, secondary, and higher educa- 
tional and age levels. Since there are a fairly large number of scales that 
come within these categories, it is neither possible nor necessary to de- 

all of them. A sufficient amount of material will be 
presented, however, to acquaint the student with their quality, char- 


acteristics and content, similarities and differences, and with the psycho- 
logical Processes they test, 
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Elementary and Secondary School Levels 


Several of the scales discussed in the preceding chapter provide 
tests at levels above the primary grades. Only those parts of their content 
(subtests) that differ from the materials used at the primary level will be 
presented here. 

Tue CALIFORNIA Test or MENTAL Maturity. This instrument pro- 
vides scales for use throughout the primary, elementary, and secondary 
educational levels, and for adults. The types of mental operations to be 
tested are regarded as the same throughout the entire age range. Whereas 
at the primary level the test items are almost entirely nonverbal, the 
Successive levels employ increasingly difficult verbal and numerical ma- 
terials, although pictorial items, also of increasing difficulty, are in- 
cluded throughout. The numerical and verbal subtests at levels above 
the primary, are as follows: 


Inferences. Each item contains two premises; the logical conclusion based 
Upon these premises must be selected from among several given choices. 
For example: 


Joe is shorter than Harold 
Harold is shorter than Sam 
Who is the shortest: Joe, Harold, or Sam? 


Number series. (1) Each of one type of item consists of a number series 

that increases or decreases according to a pattern; the testee is to select in 
each sequence the one number that does not follow the principle (for ex- 
ample, 1, 3, 5, 8, 7, 9); (2) the second type of number series requires the 
testee to select from several alternatives the three numbers that have been 
Omitted from each series. 
i Numerical quantity. (1) The individual indicates from several alterna- 
tives, how many coins of each denomination are required to make up a 
Specified sum; (2) part two consists of a series of increasingly difficult arith- 
metical problems. 

Verbal concepts. This is the familiar test of word similarities. The task is 
to select, from four choices, the one word that is synonymous with the key 
word, or nearly so. 

The subtests of these scales are so arranged that it is possible to obtain 
Separate mental ages and IQs for the verbal, nonverbal, and total scores. 
It is possible, also, to provide a profile of the twelve subtest scores; but since 
individual subtest scores are neither as reliable nor as valid as either the 
full verbal or the full nonverbal scores, or as valid as the total-test scores, 
the profile has limited value, except in instances of marked discrepancies 
etween subtest scores. 
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HENMON-NELsoN TESTS OF MENTAL ABILITY (1957).! This instrument 
provides tests at four levels: grades 3-6, 6-9, 9-12, and college.? The items 
are arranged in "spiral omnibus" form; that is, the several different types 
of items are distributed throughout the scale instead of being grouped 
by subtest type (for example, number series, word meaning). Successive 
items of each type are of increasing difficulty. When one uses such a test, 
it is obvious that only one over-all score can be derived. This is what 
the authors intend; for their instrument is based essentially on the gen- 
eral-factor theory of mental ability. They state that these scales are in- 
tended to measure ". . . those aspects of mental ability which are im- 


portant for success in academic work and in similar endeavors outside 
the classroom." 


Mark an X in the answer box that you think should be marked : 


Er] C A mi ua) 


DX aie Wr wx @ 1 «| naaa 


Fic. 16.1. Sample items from the Henmon-Nelson Test. Reproduced by per- 
mission. 


The test items included are of the familiar types which, over many 
Years, have become established as valuable for the stated purpose, al- 
though several of these are not commonly used in other tests at present 
(for example, scrambled words, scrambled sentences). The types are the 
following: figure analogies, word analogies, number series, reasoning 
(“My sister's daughter is my father’s ”), word classification, 


*Revised by T. A. Lamke and M. 
References at the end of the chaj 


J. Nelson. Names of publishers of tests are given in 
* The scale for college level w; 


pter. 
as revised in 1961, with P. C. Kelso as coauthor. 
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scrambled words (hncul: lunch), scrambled sentences, information, vocab- 
ulary, proverb interpretation, arithmetical problems. 

Although a well-standardized scale of the omnibus type, as the Henmon- 
Nelson is, can have considerable general value in measuring mental 
ability and estimating educability, its drawback is the fact that analyses 
and comparisons of functioning with different types of materials, for 
diagnostic purposes, are very difficult to make. One might want to make 
such analyses, although the scale is intended to measure the g factor. 

"In support of their position, the authors of these scales report research 
findings which indicate that “factored” (multiscored) tests do not predict 
success in schoolwork more effectively than a single over-all score. Hence, 
if a single index and predictor of school success is wanted, a soundly 
standardized scale of the Henmon-Nelson type will serve very well. On 
the other hand, if several areas of ability need to be differentiated to 
reveal strengths and weaknesses, for purposes of diagnosis and remedial 
Instruction, a sound scale arranged in the form of subtests, or a battery 
Of tests, each restricted in scope, would be preferable. ("Factored" and 
multiscored scales are discussed in the next chapter.) 

KuüHLMANN-ANDERSON TESTS (1952). This series of scales, of which 
there are nine, graded according to school level, includes thirty-nine 
subtests. Each subtest as a whole is placed according to its over-all rela- 
tive difficulty in the age range; and the items within each subtest are 
Placed in order of their difficulty. Since intelligence levels vary consider- 
ably within any single age group, and since there is overlapping among 
different age groups, there is also duplication of subtests from one scale 
to the next. Thus, the scales for the adjacent levels include the subtests 
as indicated below: 


kindergarten, subtests 1-10 


grade 1, subtests 4-13 
grade 2, subtests 8-17 
grade 3, subtests 12-21 
grade 4, subtests 15-24 
grade 5, subtests 19-28 
grade 6, subtests 22-31 


grades 7-8, subtests 25-34 
grades g-12, subtests 30-39 


The subtest types are not unusual, since they utilize the familiar non- 
verbal and verbal materials, some of which have already been described 
In the preceding chapter. The verbal and numerical subtests, beginning 
with grade 4, are as follows: scrambled words, substitution of letters for 
numerals (word building), word classification, word meanings (and in- 
formation), word opposites and similarities, word analysis, accuracy of 
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"TERMAN-McNEMan Test or MENTAL ABILITY (1949). This instrument 
is intended for use primarily in grades 7 through 12, though norms are 
provided from the age of 10 years through 19 years, 11 months. Its con- 
tent is homogeneous in that it is entirely verbal in character; for, con- 
sistent with Terman’s definition of intelligence, this scale is devised to 
measure ability to deal with items utilizing symbols and abstractions. 
The scale consists of seven subtests: information, synonyms, logical se- 
lection, classification, analogies, opposites, and best answer. 

The authors subscribe to the general-factor (g) theory of intelligence. 
They hold that the general factor is best tested by means of materials 
using symbols and abstractions. In order to achieve a high degree of 
homogeneity in test materials, they omitted from their scale arithmetical 
and numerical types of subtests, which also are widely regarded as good 
tests of the general factor. The authors state the reason for their selection 
of materials as follows: “More homogeneous material has been used in 
order to have a test more highly saturated with a common factor or 
ability. Thus, the exclusion of arithmetical and numerical subtests means 
that the scores of any two individuals are more nearly comparable quali- 
tatively; i.e., they lie along the same continuum. This continuum may 
be characterized as general verbal intelligence." The usefulness of this 
scale, therefore, is restricted to subjects who are not laboring under E 
language handicap and to situations wherein “verbal intelligence" 1$ 
required as the major ability. 


Items from 


Terman-McNemar Test of Mental Ability 
(Harcourt, Brace & World, Inc. By permission) 
Information: 
Polo is a kind of 
(1) disease (2) work (3) bear (4) game 
(5) language 
Synonyms: 
Comic—(1) clumsy (2) laughable 
(3) universal (4) tricky 
(5) peculiar 
Logical selection: 


An orchestra always has 


(1) violinists (2) piano (8) musicians 
(4) saxophone (5) singers 
* This scale is a revision of the Terman Group Test of Mental Ability, published m 
1920. This fi: 


rst version included also numerical and arithmetical tests. 
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Classification: 
(1) Catholic (2) Methodist (3) Presbyterian 
(4) Republican (5) Baptist 
Analogies: 
Zoo is to animal as aquarium is to: 
(1) birds (2) fish (3) bees (4) statues 
(5) butterflies 
Opposites: 
Exit: (1) emit (2) transcend 
(3) entrance (4) origin 


(5) arrival 
Best answer: 
The saying, "Idle brains are the devil's workhouse,” means 
(1) The devil is lazy. 
(2) People who are idle get into trouble. 
(3) Many hands make light work. 
(4) The devil works with his brains. 
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á Reliability and Validity. The group tests described thus far 
in this chapter do not present any unfamiliar problems or findings. Their 
indexes of reliability are high, as shown in Table 16.1. Also, the “errors 
of measurement” presented in the manuals are all within a satisfactory 


range of reliability. 


TABLE 16.1 


RELIABILITY COEFFICIENTS OF FIVE GROUP SCALES 


c NOM AOE ee E 


Scale Range of coefficients Method 
California .87-.95 split-half 
Henmon-Nelson 90-.95 odd-even 

.91-.93 alternate forms 
Kuhlmann-Anderson .88-.95 odd-even 

.90 test-retest 
Lorge-Thorndike .76-.90 alternate forms 

.88-.94 odd-even * 
Terman-McNemar .96 split-half 

95 alternate forms 


EO NEM A O A eur voi 
E * At level 2, the nonverbal reliability coefficient was .59. Of this, the authors 
m At this level, an odd-even reliability coefficient is really not meaningful, 
Ince there is a systematic alternation between geometric and pictorial items 


in subtests 2 and 3.” 
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As is always the case, validation of a scale is a more complex zi 
difficult task than is establishing its reliability. Each of the group scales 
described used several of the welbestablished criteria of validity, 4s 
indicated below. 


California: 


Correlations with the S-B, the Wechsler, and several other group scales: 
coefficients as usual varied from moderate to high, depending upon the 
number of individuals and range of ability they represented. h 

Intercorrelations of subtests: these varied from .25 to .6o, of which 70 
percent were below .50 and 50 percent below .40. 3 

Correlations with tests of school achievement: these were low at the pri- 
mary levels; the language subtests at later levels correlated from .42 to 81 
with the various school subjects. 


IQ distribution: a mean of 100 and a standard deviation of 16 were 
established. 


Henmon-Nelson: 


Item analysis: items were analyzed in terms of known groups anie 
average, superior) as indicated by the composite scores on three well- 
known group tests of mental ability." jas 

Correlations with another group test of mental ability (the California): 
rs varied from .74 to .86. 


Correlations with educational achievement tests: rs varied from .64 to .85 
most being in the .7os. 
Correlations with composites of school grades: rs varied from .68 to .72- 


Kuhlmann-Anderson: 


Subtest intercorrelations: two thirds of these are in the .40s and .50s. 

Subtest intercorrelations with total Scores: two thirds are between .50 
and .81. 

Correlations with school achievement: coefficients range from .60 to .80. 

Differentiation among average, retarded, and accelerated pupils. 


Relative uniformity of means, standard deviations, and ranges of IQs 
at the several grade levels, 


Lorge-Thorndike: 


Construct and content validity: to test abstract ability, defined as ability 
to work with ideas and relationships among ideas. This ability is divided 
into six aspects to be tested. 

Biserial correlation of each item with scores on the subtest of which it 
is a part. (See Chapter 5 for discussion of biserial correlation.) The median 
coefficients ranged from -43 to .70. 

Correlations with performance in school subjects: one study yielded a 
coefficient of .87 with reading, and one with arithmetic yielded .76. 
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Correlations with four other group tests: fortysix of fifty-two coefficients 
were .60 or higher. 

Correlations with S-B and WISC: these were from .54 to .77. 

Intercorrelations of subtests: these ranged from .30 to .70; eight of the 
coefficients were .50 or higher; five were in the .40s. 

Correlation of verbal with nonverbal scores separately for each grade: 
coefficients varied from .54 to .70, all but two being in the .60s. 


Terman-McNemar: 


Item analysis: percentage of pupils passing each item in successive grades 
was the principal criterion. 

Item analysis: the correlation was found for each item with total score.9 
No item yielding a coefficient of less than .go was retained. Ninety percent 
of the items yielded coefficients of .40 or higher, the mean being .53. 

Correlations with school achievement tests; median coefficients were .62 
with reading, .66 with language, .54 with mathematics, .66 with social 


Studies, and .64 with science. 


Evaluation. The five group-testing instruments described above 
àre representative of the sounder scales of their type. From the lists of 
their subtests, one can readily perceive which types of materials and test 
Problems have become established, through use and research, as most 
Valuable for educational and psychological purposes. The reliabilities of 
these and similar scales are high, so that they may be used with a con- 
Siderable degree of confidence, as far as this aspect of testing is con- 
cerned. The numbers of individuals in the standardization groups were 
large, numbering many thousands; and they were geographically and 
Otherwise satisfactorily stratified. The statistical techniques used in their 
Construction, in determining reliability and validity, give assurance of 
thorough analysis of data. Technically, they are on a high level of test 
Construction, 

From the itemized lists of validating criteria, one can readily see the 
extent of common practice in validating group scales, which criteria are 
Most frequently used, and which are emphasized by each of the several 
tests’ authors, The data on external criteria of validity (correlations with 
schoolwork, with grade placement, and with other scales, both individual 
and group) are not as full or as extensive for some of these scales as for 
Others. Each device, therefore, must be evaluated with respect to the 
Specific use to which it is to be put. On the whole, it is desirable that 
the reliability and validity of each group scale should be determined for 
every grade level, or age level, for which it is intended. "This ideal, ex- 


5p H > " i 
ee this purpose the “tetrachoric-correlation technique was used. See Chapter 5. 
à fuller discussion, the student should consult a textbook in statistics. 
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range of knowledge which the total Information score peii Lia 
1957, P- 4.) The authors might have added, also, that scope i ias 
of information are significantly correlated with other measures o 
r of intelligence.9 
E. i m AND COLLEGE ABILITY TESTS (1955-1957). 
keeping with the justifiable current practice of designating oe 
tests by their initials, these are known as the SCAT. Although t : à 
five levels, covering the range from the fourth grade through the = € 
sophomore year, and all are identical in conception and mental ee 
tested, we shall be concerned only with the college level. The m 
are of the familiar type, devised to obtain the now familiar MEAT 
quantitative, as well as total, scores. The purpose of the SCAT 1s P 
measure "schoollearned abilities" and thereby to estimate individuals 
capacities to undertake additional schooling. i Heel 
Although these scales contain test items said to be specifically relate 

to areas of study in the high school, and although use of the term 
"intelligence" is avoided, their content is not markedly different from 
that of other tests. When we test the intellectual potentialities of a sed 
lected population for continued education in liberal arts colleges and a 
higher technical schools, we are assessing their intelligence at the wo 
of abstract ability, whether we do so by means of test items that mig 
be directly related to high school studies in some degree, or whether we 
attempt to measure mental operations through test items that are rela- 
tively independent of formal instruction. For differences in what has been 
learned and retained and can be utilized in organized thinking are in- 
dications of differences in intelligence, assuming in this instance that the 
testees have had reasonably comparable secondary school opportunities. 
As a matter of fact, close examination of the test items in the SCAT 


shows that they closely resemble other tests in both form and content. 
Two representative items from the verbal tests follow. 


Sentence Meaning (Select the most appropriate word.) 


Since the two questions were completely it was necessary to 
consider them separately. 


f. irrelevant g. confused h. unrelated 
j. irrational k. theoretical 
Word Meaning (Select the word or phrase closest in meaning.) 
Recur 


a. hold in bounds b. alternate 


c. revolve 
d. happen again €. save 


* Predictive validity is discussed in a later section of this chapter. 
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In addition to the several scales described above, there are others that 
have been constructed through equally careful planning and adherence 
to principles of sound test development. Such principles include, among 
Others, extensive research and trial prior to making the tests available 
for use. Among these scales are the College Qualifications Test (1955- 
1958) and the Henmon-Nelson (1961) for college level, both of which 
are similar in content to the others. 

Reliability and Validity. Techniques of construction are now 
so well developed that it is possible to prepare tests of mental ability 
having a high degree of reliability. The several scales used for the selec- 
Non of college freshmen discussed herein—and a number of others, as 
well—all show high coefficients of reliability. For total scores, these range 
from the high .8os to the mid .gos; while for separate subtest scores, the 
Coefficients at times drop to the low .70s, although most are appreciably 
higher, rising to the low .gos. 

In the SCAT, an additional and unusual device for indicating re- 
liabilities of scores has been included. This is called the “percentile 
band,” intended to replace the familiar percentile ranks in reporting 
relative levels of scores. This device is used as a method of showing the 
Probable errors of measurement of obtained scores for each of the sep- 
arate grade levels. Thus, consider a twelfth-grade boy who has the fol- 
owing scores (see Fig. 16.3): verbal, 297; quantitative, 304; total, 300. 
On the “band,” the first falls between the 75th and the 89th percentiles 
of his group; the second, between the 65th and 82nd percentiles; the 
third, between the 79th and 86th percentiles. In each instance, these data 
are interpreted to mean that the probabilities are approximately 2 to 1 
ee in 100) that this student’s true score in each of the three measures 
alls within the indicated percentile limits. Although the bands overlap 
Somewhat, he appears to be somewhat higher in verbal than in quantita- 
tive mental activity. He is, however, well above the average of his group 
ìn both; and the band of his total score indicates the probabilities are 
very high that his combined score ranks him within the highest quarter 
is Broup.? 

From what has already been said and quoted with regard to each of 

e loregoing scales, we may infer that the authors, at the outset, were 
ie by the principles of content and construct validity in developing 

lr tests (see Chapter 5). In addition, however, these scales must have 


1 

cipal e foundness of using percentile bands has been questioned by some critics, prin- 
spuric, ecause the bands are based upon standard errors of measurement that may be 
t 'sly small, since these measures were found with reliability coefficients of speeded 
computed by a Kuder-Richardson formula. This formula is not recommended for 


8 zm 
Peedeq tests. See Chapter 4 on methods of estimating reliability of speeded tests. 
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SCAT STUDENT PROFILE 
SCHOOL AND COLLEGE ABILITY TESTS 


Verbal Quantitative 
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Fic. 16.3. A percentile band. Reproduced by 
permission. 


A predictive validity, since their purpose i$ to identify those stu- 
a who 8ive promise of ability to learn at levels beyond the secondary 
school. That is, the test Scores must correlate well with grades earned 


at the college level. Table 16 i f 
E :2 shows these a group 9 
representative scales, Veg oig 


The wide range of validit 
fully explained without know: 
sented in each study, 
of each institution i 


y correlations for a given scale cannot be 
ing the variations in scholastic ability repre 
the admissions criteria and the scholastic standards 
n which thc study was made, and the curricula 10 
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Which the students were enrolled. In view of the differences in correlation 
coefficients of prediction, it is essential that a scale's predictive validity 
be Separately determined for each of the several types of institutions. 
Ideally, it would be desirable for each institution to cross-validate a scale 
for its own purposes. Associated with this is the desirability of providing 
Separate norms for different types of institutions. This has been done in 
Some instances. 


TABLE 16.2 


PREDICTIVE VALIDITY CORRELATIONS OF SCALES 
FOR COLLEGE FRESHMEN 


ee y. A ee ee 


Scales Coefficients Criteria 
American Council (ACE) .25-.60 Total point average 
College Placement Test (CPT) 57 Freshman grades 
College Qualification Test (CQT) -26-.67 Junior college; first term 
grades 
.35-.68 Publicly controlled colleges; 
first semester grades 
.94—71 Privately controlled colleges; 
first semester grades 
Henmon-Nelson 54 First semester grades 
Ohio State University .58-.65 Freshman grades 
cholastic Aptitude Tests (SAT) 45* Verbal scores with freshman 
average 
.58* Math. scores with freshman 
average 


RAT -43-.68 Freshman grades 
=== EOE E A c t uS 


» " 
These are average correlation coefficients for a large number of studies, 


In Table 16.2 most of the correlation coefficients are based on over-all 
Scholastic averages. The student should be aware of the fact, however, 
that the coefficients found between the tests and individual subjects of 
Study May, and often do, differ significantly. It is desirable to be familiar 
With the efficiency of total scores and part scores (for example, verbal 
as Quantitative) in predicting performance in the several areas of study 
ve as in general. For example, in one investigation, using the SAT, 
EN Scores correlated with grades in physics from .28 to .36, while the 
ead (quantitative) scores correlated from 48 to .59 with the 
Edes Verbal scores and scholastic averages for the freshman year 
is to ae -45, while the scores of the quantitative tests correlated .58. It 

€ expected, and it does so happen, that the linguistic tests will 
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itati related 

usually yield higher zo ie D the sug deed Sube a 
i in social studies and the huma : E 
TA ce am are correlated with grades in mathematics and the 
ces. " 
BL noted, also, that the correlational studies deal i 
the most part, with the college freshman year and pame i, 
the sophomore. In many institutions, the coefficients will be Eun 2 
higher for these first two years than for the last two. In ra em eed 
liberal arts in a large university, the coefficients dropped gra i“ iei 
-48 in the first semester of the freshman year to .18 in the secon ee 
of the senior year. The students, as a group, become less Pe e 
in successive years; nonintellectual factors (social, emotional, — sid 
can become increasingly important; the several areas of study a P 
cialization are not of equal difficulty; grading may be less en E 
at the higher course levels; as students become immersed in a su p PA 
study, smaller differences in ability are not significant, hence 
ect grades. Bec 

T p of correlation are significant, of course, in aimata 135 
predictive validity of tests. It appears, however, that too much T 
has been put upon them, often to the exclusion of another nar a 
technique that can provide even more significant data; that is, expe tec 
tables, of which there are several types (see Chapter 5). These two e 
niques, each used to supplement the other, will provide much € ghly 
sight into the validity of a test than either used alone. It would be 
desirable to have expectancy 


í ollege 
Evaluation. Asa group, current tests for the selection of colleg 


much so, in fact, that 
another, in general con 

A major criticism ag 
studies of predictive 
institutions, not ade 
technical schools. Therefore, 
sible value for a Particular i 
istics of the instituti 
ardized be examine. 
tion. 

Test scores at the coll 
individuals within the 
the middle 6o percent) 


ege freshman level do not differentiate kane 
middle range of the distribution Gppror ed 
for the purpose of predicting scholastic achi 
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ment. There are several reasons for this: the middle group is less hetero- 
Beneous than those at other levels of the distribution; the tests are not 
sufficiently refined to discern the small differences within this group; the 
numerous nonintellectual and chance factors that affect students’ course 
marks can be more significant in their influence upon correlations be- 
tween test scores and scholastic achievement than is the case at the upper 
and lower levels, For example, one study showed that the scale had con- 
siderable value in predicting quality of course work and duration of 
attendance for the highest 20 percent and the lowest 20 percent. Predic- 
tions for the middle 60 percent were unreliable. Even this limited useful- 
ness warrants the use of the sounder tests, when the results are interpreted 
by qualified persons, since in this way the most promising and the least 
Promising prospective students may be recognized. The soundest prac- 
tice at Present, however, is to use the test score as one of two or three 
Primary criteria, the others being the secondary school record and scores 
on entrance examinations (educational achievement tests). On the whole, 
College tests of mental ability show up very well when one considers the 
uncontrollable variables affecting course grades, with which test scores are 
Correlated. Some of these are subjectively determined high school and 
college grades, fluctuations and variations in motivation, factors of health, 
Sconomic pressures, personality traits, and emotional problems of late 
adolescence. 

Tests for Graduate Students and Superior Adults. The use of 
tests of mental ability has been extended, rather widely, to graduate and 
Professional schools. One of these, the Graduate Record Examination 

E), will be described in a later chapter dealing with educational 
achievement tests; for it includes sections designed to measure learning 
and Mastery of subject matter in the several areas of undergraduate study 
as Well as subtests which, though on a higher level, are much the same in 
orm and purpose as those used in measures of mental ability for college 
Work. The most widely used measure of general mental ability for pros- 
Pective Braduate students is the Miller Analogies Test. Another instru- 
Ment, devised specifically for use with superior adults, is Terman’s Con- 
Sept Mastery Test. Both of these will be described to show the direction 

“ng taken in testing at the higher levels of mental ability. 1 
an wt ANALOGIES TEST (1926-1960). Originally devised in 1926, this 
sch, as been developed to measure scholastic aptitude at the graduate 
Ool level. It consists of a large number of items, with a time limit of 
'Y minutes; but the speed factor is said to be of negligible importance, 
d e test includes analogies covering a wide variety of fields of learning 
uk Specialization. Although quantitative as well as verbal materials are 

» the items are predominantly verbal in character. The logical rela- 
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tionships to be discerned within each of the items are not uniform; thus 
the subject must reorient his analysis with each item. (It is not possible 
to illustrate or describe the items, since the test is not circulated, and its 
use is strictly controlled.) The material in the test is often abstruse; and 
many of the analogies are complex. Consequently, this test has a very 
high upper level of difficulty, regarded by some as probably the most 
difficult among tests of abstract ability. 

Since the intellectual levels of graduate students are not uniform from 
university to university, nor among the several departments within a 
given university, percentile norms are provided for each of the several 
fields of study, for each of a group of universities, and for a few pro- 
fessional schools. The differences among some of these groups are api 
preciable; as, for example, when the median (50th percentile) score 1n 
one subject-matter group is equal to the 8oth or goth percentile score of 
another. 

So far as reliability is concerned, this is a well-constructed test. Odd- 
even reliability coefficients are in the low .gos; alternate forms correlate 
between .85 and .89. 

Validity has been estimated largely, but not solely, by the instrument's 
ability to predict grades in graduate courses and on comprehensive 
graduate examinations, Findings on predictive validity at the university 
graduate level are affected by the same determinants as those at the college 
level, that is, subjectivity of marking standards, differences in intellectual 
demands made by different courses and areas, differences in standards 
among institutions, and individual motivation. Correlations at the gradu- 
ate level are probably depressed, also, by the greater homogeneity of the 
students in regard to level of ability, as compared with the generality of 
undergraduates. Correlation coefficients of validity, with course grades 
and comprehensive examinations, have varied from about .90 to the high 
“798, most being in the .40s and .5os. When correlated with the subject 
matter sections of the GRE, the coefficients were, for the most part, be- 
tween .75 and .80; while with the “verbal factor” of the GRE (that is, the 
verbal part of the test of general mental ability), the coefficients were in 
the low .8os. 

In view of these high coefficients between the Miller test and the GRE. 
one might ask why both are often required by some university depart- 
ments or why the latter should be given at all, since it takes several hours 
while the Miller requires less than an hour. This question has been at 
Swered, in some instances, by requiring only the GRE, because it com 
bines educational achievement tests with tests of general mental ability: 
Those who use both, do so, presumably, because of the subject-matter 
content of the GRE and the high degree of difficulty at the upper level 
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of the Miller test; and also because one instrument can supplement the 
other. 

Ability to learn verbal and quantitative concepts and course materials 
is an important factor in successful graduate study. The Miller Analogies 
Test can make an important contribution to assessment of this ability 
among individuals in a selected group.5 More extensive information 
would be desirable, however, regarding its predictive value in each of the 
Several areas of graduate study: social studies, physical sciences, biological 
Sciences, and the humanities; and perhaps even in each of the depart- 
mental divisions within each area. Cross-validation by each university, for 
its own guidance, is also desirable. 

What the analogies tests do not evaluate—nor do they purport to do so 
—is creative research ability and originality in constructive thinking, both 
of which should rank high in graduate study. 

Concept Mastery TEST (1956). Developed under the direction of 
L. M. Terman, this test was devised as a measure of ability to deal with 
abstract ideas at a high level. Its purpose is to measure intellectual func- 
tions similar to the Stanford-Binet; thus, it would be "highly saturated" 
With the general factor (g). The original form was constructed to measure 
mental ability of superior and gifted individuals, in early maturity, in 
the long-term longitudinal studies conducted under Terman's direction. 
A new instrument was necessary for this group because the Stanford-Binet 
and other available scales did not advance to a high enough level of dif- 
ficulty to test and differentiate among them. A later edition (Form T) was 
developed for use with the gifted group and was released for use with 
College students at the junior-year level and higher, and for other adults 
at the upper levels of mental ability. 

€ content is of two familiar types—identification of synonyms and 
antonyms, and completion of analogies. The analogies employ both verbal 
and numerical concepts; and they draw on a wide variety of subject- 
Matter fields, 

The CMT, like other carefully constructed tests, has a high degree of 
reliability, Alternate forms, used with four groups, gave coefficients from 
86 to :94- Two of these groups (undergraduates and graduate students at 
Stanford) were retested at intervals of one day to one week. Two others 
(gifted Subjects and their spouses) were retested after an interval of eleven 


UA Similar test at a high level of difficulty, called the Advanced Personnel Test, has 
iR Prepared by Miller. It is a measure of verbal reasoning ability for use by business, 
De Ustry, and government "for employment and upgrading of management and research 
Sonne The Doppelt Mathematical Reasoning Test has been prepared “in response 
A need expressed by the users of the Miller Analogies Test and the Advanced Personnel 
del fora comparable high-level measure of numerical reasoning ability." The Miller 
the Doppelt tests, both restricted, are distributed by The Psychological Corporation. 
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to twelve years. The high reliability coefficients for these two: em bd 
and .92), after such a long period, attest not only to the test's relia 3 : 
but also to the relative stability of mental functioning by individuals a 
the highest levels of mental ability. i — 
Validating a test of intelligence for use at the superior adult le 
presents more than the usual difficulties, because satisfactory pe 
criteria are not readily available, nor are they as reliable as those use 
in validating tests for children and adolescents. Terman and his colleagues 
employed a number of criteria that provide favorable evidence. ene 
relations of parts I and II (synonyms-antonyms with analogies), for Hg 
groups, were high: .75 and .76. For a small group (N = 59) the CMT he 
the SAT correlated .70, The mean and the standard deviation of the C 
classified individuals in a manner consistent with Stanford-Binet IQs 
obtained in childhood. Mean scores of the CMT progressed in contor taa 
with the IQ distribution obtained in childhood. For a group of gifted P 
sons and their spouses, the mean score on the CMT is significantly e 
ciated with levels of higher education achieved. The same was true of : 
group of U.S. Air Force captains. When correlated with grade-poin 
averages of a group of undergraduates (N — 97), a coefficient of .49 oe 
found.® Again, and as stated in connection with other tests, an instrumen 
such as this, intended for a highly selected segment of the population, 
should be validated for each institution and situation in which it is to be 


: 3 Per is as- 
used, unless reasonable conformity with the standardization group 1$ 
sured, 


Representative Scales for Adults 


It is not a common Practice to test adults in general, as is the 
case with children, adolescents, and college students who are tested for 
purposes of educational selection and vocational guidance. When adult 
are tested, it is for specific reasons, such as preliminary screening of ee 
ness and industrial personnel; Screening in the armed forces; and uP 
grading of individuals.10 When tests of general ability are given for thes¢ 
purposes, they are usually supplements to tests of specific aptitudes (see 
Chapters 17, 18, and 19). Both kinds are used as predictors of ability to 


y 2 
* The grade-point averages were correlated also with CMT scores for a group of 124 
undergradu: 


: H H i was 
gi ates who were seen at the university's counseling center. The coefficient 
only .57. This index, however, 


i should not be regarded as representative of thc a 
differentiating ability, since students who seck the help of a counseling center EHE a nt 
Eroup, having more than the usual personality problems and difficulties of adjustme 
to college work. 


"Individual tests of mental ability are used with adults, of course, in hospitals, 
clinics and in 


the private practice of psychology. 
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learn specific types of materials and as predictors of performance in a 
specified type of occupation. 

Two of the scales already described in other connections, the California 
and the Kuhlmann-Anderson, provide tests at the adult level. It would be 
possible, also, to use a scale (for example, the Lorge-Thorndike or the 
Henmon-Nelson) that provides norms through grade 12 or higher; but 
if that were done, adults would be rated only with reference to grade and 
age norms within the scope of each scale. 

Some instruments, however, have been constructed specifically for adult 
use, For example, among the older scales are the Pintner Advanced Test, 
which begins with the ninth grade and continues through adult levels; 
the Otis Self-Administering Test of Mental Ability: Higher Examina- 
tion; and the Army General Classification Test (AGCT), based upon the 
Army Alpha, which was used in World War I. A more recent adaptation 
is the Modified Alpha Examination, Form g. This last-named instrument 
follows the currently common practice of providing a total score and 
Separate scores for the numerical and for the verbal test items. All of 
these scales include types of items with which the student is already 
familiar, Perhaps the only type to which attention should be called is 
block-counting (spatial perception) in the AGCT, now infrequently used 
1n adult tests of general mental ability. 

Since the end of World War II, there has been a tendency to develop 
and use brief or abbreviated scales in adult personnel selection. 'There 
are, for example, the Thurstone Test of Mental Alertness (1943-1953) 
(twenty minutes), the Wesman Personnel Classification Test (1946-1951) 
(twenty-eight minutes), and the Wonderlic Personnel Test (1939- 
1945) (twelve minutes). Occasionally, a special brief screening test is de- 
vised for a specific business or industry. The form of the content and the 
mental operations involved are the same as those used in scales intended 
for use with a more general segment of the population; but the specific 
‘tems are related to the type of business in which the test is to be used. 
For example, the American Stores Company uses its own Personnel Selec- 
tion Test (constructed in collaboration with staff members of The Psy- 
chological Corporation). This test, naturally, is heavily loaded with items 
related to the grocery business. u 

The adult scales available for general use, by authorized professional 
Persons, are not interchangeable. One (for example, the Wonderlic) may 
have its highest validity in the selection of personnel for certain types of 
clerical work; another may be more valid in selecting individuals for 

igher leve] rather than lower level employment (for example, the Wes- 
Man). It has been found that validity data vary considerably with the 
€vels and types of work, the characteristics of the sample of persons 
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tested, and the criteria used. In these respects, adult tests used for screen- 
ing and selection are comparable with those used in educational institu- 
tions at all levels. The problem of obtaining a representative sample of 
the general adult population, however, is a more difficult one than getting 
one of children, adolescents, or college students. It is most essential, there- 
fore, that reliability and validity of tests for adults be determined quite 
specifically as to population samples and types of occupation.! 


Evaluation of Group Scales 


COMPARISON WITH INDIVIDUAL SCALES. Group scales were developed to 
permit the testing of large numbers of persons at one time. On the whole, 
therefore, they are not so useful as are individual scales in studying an 
individual case. For when a group scale is used, it is not possible to ob- 
Serve a person's approach to the solution of problems, nor his behavior 
under success and failure. Nor is it possible to evaluate the qualitative 
characteristics of his responses, since group scales are scored quite rigidly. 
Furthermore, it is difficult—in fact, practically impossible—to know 
whether an individual is exerting his maximum effort when taking a 
group examination. It is possible to report the test results only in terms of 
numerical indexes (plus profiles, at times), whereas during an individual 
examination the psychologist is able to make behavioral and qualitative 
observations of considerable value. 

Practically all group scales below college level have been validated 
against individual scales—especially the Stanford-Binet—as one of the 
principal criteria. This fact in itself is a recognition of the merit of the 
individual scale, the quality of which the group scale is trying to approach 
as closely as possible. Other criteria of validity are the familiar ones dis- 
cussed in this and earlier chapters. 


In discussing the definitions and analyses of intelligence, it was stated 
that one deficiency of all tests is that they do not measure the creative 
aspects of intelligence; nor do they directly measure the insights that come 
from experience (“wisdom,” "judgment", or productive thinking, or the 
intellectual originality of an individual. This deficiency is more marked 


in group than in individual scales, because of the rigidity of scoring the 
former. 


THEORETICAL AND STATISTICAL Basrs. 
are based, implicitly at least, 
ligence. A number of the mor 


Most of the earlier group sa 
upon the general-factor theory of intel- 
e recent instruments are also based upon 


E $ n 
t a this connection, the student should consult various issues of Personnel Psychology 
T information on validity and normative data as they relate to these tests. 
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this theory, it appears, even though they yield separate scores of verbal 
and quantitative aspects of mental ability. Several scales are based upon 
the group-factor theory. This latter type will be described and discussed 
more explicitly in the next chapter. 

Since many group tests, of varying quality, have been published, it is 
€ssential that prospective users examine the manuals closely to determine 
which of these satisfy the standards that should be demanded of them. 
The reader is already familiar with the standards and methods of es- 
tablishing reliability and validity. These should be rigorously applied to 
group tests. In this connection it is essential that the manual state which 
method was used in determining reliability, especially if the speed factor 
Seems to be a significant one. 

Since group tests for children and adolescents are used primarily to 
assist in dealing with educational problems, it is essential that the scale's 
Predictive efficiency, with regard to schoolwork and progress, be reported 
as one criterion of validity. Group tests devised primarily for use with 
adolescents and adults in occupational selection should also provide 
Specific information regarding their predictive validity. 

Scores on a scale as a whole are more reliable and more valid than 
Subtest scores. A distinction should be made, therefore, between subtest 
reliability and validity, on the one hand, and total-scale reliability and 
Validity, on the other. This distinction is especially pertinent when a 
Scale’s subtest scores are to be used for differentiating and diagnostic pur- 
poses, 

The manual should give not only the size of the standardization popu- 
lation sample, but the characteristics of that sample should be specified, 
Such as geographic and socioeconomic distributions, range of ages, range 
oE ability levels, range of school levels, and sex distribution. Similar 
Pertinent facts should be provided with tests designed for use with adults. 

Criteria of Evaluation. In evaluating a group scale with a view 
to its probable usefulness in a given practical situation or in the study 
of a theoretical problem, it is customary to use the following criteria. 


1. It must be sufficiently valid and reliable. x 

2. The range of norms must be adequate for the group for which the 
Scale is devised, 

3. The item difficulty in each subtest must be of sufficient range to differ- 
€ntiate among the various levels of ability. Individuals at the lowest and 
highest levels should be able to obtain scores that represent their levels. 

4- In general, the range of ability to be tested (ages, school grades, oc- 
Cupations) should be restricted rather than all-inclusive. If the range is 
Testricted, a given number of items and a given length of time can be used 
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for a more thorough and accurate examination than if a scale of the same 
length were employed to cover a wider range. In the latter instance, the 
test items would have to be spread more thinly. 

5. Length of the scale must be adequate. In time required, group scales 
vary from about one-half hour to three hours, depending upon levels for 
which they are intended. The great majority of scales require one and one 
half hours or less. Increase in length, to an optimal point, adds to validity 
and reliability, since errors of measurement are decreased (better sampling) 
as length is increased to an optimal point. Judging from current practices, 
based upon experiment, optimal lengths appear to be about a half hour 
at the level of kindergarten and primary grades, about forty-five minutes at 
the level of elementary grades, and up to about an hour, or an hour and a 
half, at higher levels, 

6. Simplicity of responses is frequently regarded as an asset in grong 
tests, For some purposes—when group trends are sought, rather than indi- 
vidual performance—this is an asset simply because scoring is facilitated. 
But, as already pointed out, such simplicity and consequent rigidity may 
limit the value of tests when evaluation of an individual's responses is 
desired. 

7. Simplicity of scoring is also frequently considered to be an asset, 
since it is actually a result of simplicity of responses. The same comments 
apply here as above. 

8. Ease of administering a group scale is desirable. Frequently, group 
scales have to be given by relatively inexperienced persons; it should, there- 
fore, be possible to train them in a brief time to administer the scale ac- 
curately and with precision. Also, simplicity of instructions and procedures 
in giving an examination to a group reduces the possibility of confusion and 
misunderstanding on the part of individuals in the group. 

9. The examiner's manual should be clear and complete in respect to 
standardization procedures and results, nature of the content, directions 
for administering and scoring, norms, and interpretation of results. 

10. The content of the tests should be interesting to the groups for whom 
intended. 

11. The content of the tests should be appropriate to the subjects being 
examined. That is to say, the psychologist must determine whether or not 
in a given instance, it is desirable to use a scale that is entirely verbal, or 
entirely nonverbal, or verba] and quantitative, or mixed. His choice of 


scale will depend upon who is to be tested and the purpose for which the 
test is being given. 


Uses of Group Scales 


Without going into details, 


the ways in which group scales have 
been used will be Biven. $ icd 


In schools, they have been used for purposes of general survey, ability 
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classification of pupils, and guidance. Under general survey, studies have 
been made of the range and distribution of mental ability; age and grade 
overlapping of ability; differences among pupils in various schools within 
the same community; differences among pupils in different school systems; 
differences among pupils in the several high school curricula; the effects of 
different methods of instruction upon pupils at the several levels of 
ability; relations between intelligence-test ratings and school achievement 
in general, and in specific school subjects; and comparisons of city, town, 
and rural children. 

In classifying pupils according to ability level for the purpose of dif- 
ferentiated instruction, a test of mental ability is, of course, basic, though 
it should not be the only criterion. 

Since relatively few schools include a qualified psychological examiner 
On their staffs, and since extensive individual examination is costly and 
time-consuming, group scales are being used for most guidance purposes. 
However, in view of the fact that group-test ratings may indicate only the 
approximate level of an individual's mental ability, they must be used in 
conjunction with other available evidence obtained from school records, 
teachers’ reports, objective achievement tests, and interviews. But psycho- 
logical-test ratings, correctly obtained and interpreted, tell us much more 
about a pupil’s mental level and organization of abilities than could be 
ascertained without their use. 

Group tests have been applied extensively to a large number of theo- 
retical and practical problems of psychological, educational, and socio- 
logical significance, such as individual differences in relation to sex, 
Tacial and national membership; mental levels and characteristics of 
Special groups, such as the mentally deficient, the gifted, and the delin- 
quent; employee selection for jobs requiring different levels of ability; 
family similarities and the inheritance of intelligence; effects of changed 
environment upon mental level; the nature and course of mental develop- 
ment; the nature and organization of intelligence; constancy of the IQ 
and prediction of later ability; and problems of theory and technique, 
Such as the relationship between “speed” and “power” as aspects of in- 
telligence, Then, of course, there was the vast use of group scales in the 
M TM Wort War s for screening and classification of 

personnel. 

‘The foregoing enumeration is not complete; but it suffices to show the 
Wide Tange of application of group tests of mental ability; and it helps to 
€xplain why tests should be under continual scrutiny in an effort to 
Increase their validity and reliability. 
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MULTIFACTOR TEST BATTERIES 


Characteristics 


Multiple-factor batteries are those scales that are intended for 
the independent measurement of each of several kinds of mental opera- 
tions and that provide a separate score for each.! Ordinarily, a single in- 
dex (IQ, MA, Z-score, etc.) is not obtained for the entire battery of tests. 
The kinds of mental abilities tested are described in the following pages.? 

These instruments stand in contrast to other tests of mental abilities, 
Such as the Stanford-Binet, the Wechsler scales, the Kuhlmann-Anderson, 
the Henmon-Nelson, and the Lorge-Thorndike, which provide a single 
index and whose items and subtests are so selected as to yield a unitary 
and internally consistent measure of intelligence. This is the case, even 
though separate verbal and numerical scores or verbal and performance 
Scores may be derived from some of these scales, in addition to an over-all 
rating. In the multifactor batteries, the objective is to obtain a number 
9f separate scores in order to differentiate, if possible, among the several 
abilities within each individual. In terms of correlations, the authors of 
Over-all tests of general intelligence strive to develop items and subtests 
that have significant positive intercorrelations, while authors of multi- 


t ‘These are known, also, as multiple-aptitude batteries, and as differential-aptitude 
ests, i j 


4 R i " " 
ui Tests constructed to:measure specific aptitudes, as in music, graphic arts, mechanical 
ae and professions, employ different types of materials in part or entirely. They are 
cussed i 
ssd in Chapters 18 and 19. 
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factor batteries aim to devise parts that have as little — m 
they can achieve. Otherwise stated, multifactor tests erem y : 
tended to measure each of several pure factors or each of several "i 
binations of a few closely related factors, with minimal —— is 
abilities measured by each part of the battery. (See Chapter 7, mer 
tions and Analyses of Intelligence.) From these efforts have emerged a 
number of terms to designate psychological functions, such as spatia 
visualization, perceptual speed, and verbal facility. . . h ? 

There are two principal reasons for the growth of interest in this typ 
of test battery, especially since the termination of World War Il in hd 
Some psychologists were impressed by the variations within some n: 
viduals (intraindividual variations) in their performance on the avera 
parts (subtests) of measures of intelligence. In other words, they were mo : 
impressed, it seems, by the absence of very high correlations among som 
subtest scores than by the fact that intertest coefficients have always been 
found to be positive and at a quite significant level, statistically and T 
chologically. A second reason for interest in multifactor batteries is t " 
desire of the educational and vocational counselor, and of the pae 
psychologist, to have tests of this type for use in counseling students anc 
in selecting personnel in business, industry, government, and the military 
forces. à 

The rationale of multifactor-test batteries, then, is this: since most of 
them aim to measure relatively independent factors, each part of RE 
battery (each type of test) can provide its own norms and can be validatec 
for each of a variety of occupations and curricula. Thus, it is maintained, 
an individual's potentiality in a number of academic and vocational areas 
can be analyzed and appraised by means of a series of relatively brief 
tests. The merit of this view rests, of course, upon the ability of each ot 
the several types of test materials in the battery to differentiate SIE 
nificantly among capacities required for successful performance in dif- 
ferent occupations or courses of study. 

Several of the multifactor batteries will be described and evaluated. 


Representative Batteries 


_ Tests OF PRIMARY MENTAL ABILITIES. The Thurstones have pubs 
lished a series of multifactor tests originally known as The Chicago 
Tests of Primary Mental Abilities (published by the American Council on 


Education, 1941-1947). Later editions are named the SRA Primary Men- 
tal Abilities3 Of the latter, there are three levels: ages 5-7, 7-11, and 


i * The initials “SRA” stand for Science Research Associates, Inc., of Chicago, the pub- 
lishers at present. ; 
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11-17. For the most part, but not entirely, we shall deal with the scale for 
ages 11-17. All scales in this series are based upon the same psychological 
and statistical principles, that is, the group-factor theory of mental ability, 
the factors having been inferred through the use of factor analysis. At the 
11-17 age level, the following factors are tested: 


Space 


Look at the row of figures below. The first figure is like the 
letter F. All the other figures are like the first but they have 
been turned in different directions. 


ee AL A T Ue, E 


Now look at the next row of figures. The first one looks like 
the letter F. But none of the other figures looks like an F 
even if they were turned right side up. They are all made 
backward. 


Fo} Sp Dd Dees, Ses 


Some of the figures in the next row are like the first figure. 
Some are made backward. The figures like the first figure 


are marked. 


Reasoning 


Study the series of letters below. What letter should come 


next? 
Gilt apu prab abea of 
The next letter in this series should be a. The letter a has 


been marked in the answer row at the right. 

Now study the next series of letters and decide what the next 
letter should be. Mark the letter in the answer row at the 
right. 

cadaeafa A SDN 
You should have marked the letter g. 


Fic. 17.1. Practice items from the SRA Tests of Primary 
Mental Abilities. Science Research Associates, by permission. 
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Verbal meaning. This test is measured by the familiar synonyms type of 
item; to test "ability to understand ideas expressed in words." 

Space (spatial perception). Test items use various simple designs and geo- 
metric figures, differently rotated, to test perception of the relations of an 
arrangement of objects in space; to test "ability to visualize objects in two- 
or three-dimensional space." " 

Reasoning. Each item consists of a series of letters arranged according 
to a pattern, or rule, which has to be recognized; to test "ability to solve 
logical problems." 

Number. 'This test consists of simple, two-column problems in addition, 
to which answers have been given; some are right, others are wrong; the 
answers are to be marked accordingly; to test "ability to work with figures— 
to handle simple quantitative problems rapidly and accurately." 

Word fluency. The test requires the writing of as many words as possible, 
begininng with a given letter, in a specified time; to measure “the ability 
to produce words easily.” 

At the lower levels, three other types of tests are also used. These are 
perceptual speed, ability to recognize likenesses and differences between ob- 
jects and symbols; quantitative, ability to understand the meaning of num- 


bers and to recognize quantitative differences; motor, ability to coordinate 
eye-and-hand movement. 


Since this battery is intended to differentiate among the several kinds 
of mental operations, the first question to ask is this: how low are the 
intercorrelations among the parts? Table 17.1 provides an answer, as r€ 
ported in the 1958 Manual. 


TABLE 17.1 


MEDIAN INTERCORRELATIONS OF PMA Sustests: Six Groups 
(Each Sustest Was CORRELATED WITH THE OTHER Four.) 
—— ee ee eee eee 


Subtests Range of medians 
Verbal meaning -24-.50 
Space -13-.34 
Reasoning -34-.50 
Number .17-.85 
Word fluency -13-.37 


SSS ee 


five are in the .gos, two in the .20s, tWO 
os. The two lowest, .17 and .13, both in- 
h is not surprising, considering the nature 
» these coefficients are low; thus, they lend rea 
nale of this battery, although abilities T€ 
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quired by the five subtests are not independent of one another, since six 
of the ten coefficients are above .30 (to take an arbitrary cut-off value). 
From these data, however, we may conclude that the PMA tests have a 
reasonable degree of factorial validity. Whether the mental operations 
involved in these tests are significant aspects of the complex mental 
processes of intelligence, or whether these tests have significant predictive 
validity are other questions. 

Validity of this battery was studied by the usual methods: correlations 
With other tests of intelligence, tests of specific aptitudes, educational 
achievement tests, and school grades, with the over-all results shown in 
Table 17.2. 


TABLE 17.2 


PMA VaLiDiTY CORRELATIONS 
Ba 0 0 0 — CO A N A UMEN NEN ee 


Criteria Correlations 
Otis Test :71 (multiple R with subtests) 
Kuhlmann-Anderson .63 (multiple R with subtests) 
GATB* .16-.77 (range of highest rs with 

subtests) 
ITED + .18-.73 (with subtests) 
Grade averages (9th grade total grade .27-.55 (with subtests) 
averages) 


Grade averages (total averages in 9th .29-.60 (with subtests) 


and 10th grades) 


Grades in high school courses .16-.51 (with subtests) 


* General Aptitude Test Battery, U.S. Employment Service. Described later in 


this chapter. 
Iowa Test of Educational Development. 


these correlations are, on the whole, 


No higher and often lower than those found with tests of general mental 
ability, the advantages of using this instrument are not apparent. This 
reservation is especially relevant because the authors of the PMA battery 
conclude that the tests of verbal meaning and of reasoning yield the most 
Satisfactory correlation coefficients with criteria of school success. These 
‘Wo subtests are similar to those included in scales designed to measure 
Beneral mental ability. Furthermore, tests of general ability include items 
Which, in complexity and structure, approximate better than do the PMA 
the kinds of mental activity actually required in school work.* 


Since, for educational purposes, 


P Reliability data of this and the other multifactor tests are presented in a single table 


HESS 
ter in this chapter. 
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Tue DIFFERENTIAL APTITUDE Tests. This battery of tests is a mse 
ambitious instrument. Its statistical data on standardization and on analy- 
sis of later studies are exceptionally thorough and Uie aueh owe Al- 
though the tests are intended for use in grades 8 through 12, in ed aw 
tional and vocational guidance, they may also be used with o 
adults. Several guiding principles were followed in developing the severa 
parts. . 

All eight tests of the battery were standardized on the same population. 
"Thus the norms and percentile values for each test have the same € 
significance as those for all the other tests in the battery; for the ranges a 
age, aptitude, school grade, and nonintellective personality factors were 
constant in the standardization process. Psychological profiles, therefore, 
are more meaningful for interpretation of differences within an xin 
vidual. The published norms are based upon a population sample o 
47,000 boys and girls in grades 8 through 12, from communities through- 
out the country. Separate batteries and norms are available for boys and 
girls. 

Each of the abilities represented should be independently tested. 

Each test should measure level of ability; that is, power rather than 
speed of performance, with the exception of the part testing clerical 
speed and accuracy. 

The battery of tests should yield a profile in terms of percentile ranks, 
all of which will be comparable for a given individual since they have 
been derived from the same population sample (see Fig. 17.4). 

Usefulness in practical counseling is of primary concern. Although 
theoretical concepts and findings, including factor analysis, were taken 


into consideration, the parts of the battery are not intended to be "fac 
torially pure.” 


The battery includes the following eight tests. 


9209 


‘2. Space relations item from the D 


EXAMPLE Y 


UT ifferential Aptitude Tests. The exami- 
nee selects the three-dimensional figures that can be made by folding the two 


dimensional figure at the left. The answer is figures A, C, and E. The Psycho 
logical Corporation. Reproduced by permission. 
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Each row consists of four figures called Problem Figures and 
five Answer Figures. The four Problem Figures make a series. 
You are to find out which one of the Answer Figures would be 
the next, or the fifth one in the series. 


PROBLEM FIGURES ANSWER FIGURES 
A B c D E 


Abstract reasoning practice item. Study the position of the 
black dot. Note that it keeps moving around the square clock- 
wise: upper left corner, upper right corner, lower right corner, 
lower left corner. In what position will it be seen next? It will 
come back to the upper left corner. Therefore, B is the answer, 
and you would mark your Answer Sheet like this: 


E 


o 
Qa 
o 


TEST ITEMS SAMPLE OF ANSWER SHEET 


| Fic. 17.3. Clerical speed and accuracy items from the Differen- 
tial Aptitude Tests. In each row of the Sample Answer Sheet, the 
examinee marks the combination that matches the underlined 


combination in the corresponding left-hand row. The Psycho- 
logical Corporation, by permission. 

the familiar type are used to measure 
al concepts and relationships. (Since 
f verbal reasoning, a more appro- 


Verbal reasoning. Verbal analogies of 
ability with more or less complex verb 
hod test includes only a limited area o 
Priate designation would be "verbal analogies.”) 5 y ; 

Numerical ability. Numerical relationships and facility with number 
Concepts are tested. These are essentially computational, rather than prob- 
= solving. Some of the items test only proficiency in the four fundamental 
Processes; others require understanding of quantitative concepts and rela- 
tionships. 

Abstract reasoning. Ability t 


5 
Compute wi j TS 
th M Analogie: 
Chapter 16. ith the Miller g 


o reason with nonverbal materials is tested. 


s Test and the Terman Concept Mastery Test, in 
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The series of designs presented in each item requires understanding of an 
operating principle in producing changes in members of the series (see 
Fig. 17.8). : 

Space relations. Ability in spatial visualization is tested by presenting two- 
dimensional geometric figures (variously shaded). These are imaginally 
manipulated, each to form a three-dimensional figure. The purpose 3$ to 
test ability to visualize constructed figures, variously rotated (see Fig. 17.2). 

Mechanical reasoning. Mechanical comprehension is tested by means of a 
series of pictorially presented situations involving mechanical and scientific 
principles. Each picture is accompanied by a brief and simply worded 
question about the principle involved (see Chapter 18, on tests of mechani- 
cal aptitude). 

Clerical speed and accuracy. Speed and accuracy of responses to letter- 
and-number combinations are measured. This test requires the matching 
of various combinations, emphasizing perception of detail and rate of d 
sponse. The items are intended to "approximate the elements involved in 
many clerical jobs." 

Language usage. Part I is a spelling test. Some words are correctly spelled, 
while others are misspelled. The testee indicates, for each word, whether it 
is correctly or incorrectly spelled. - 

Language usage. Part II consists of sentences in which the examinee is 
required to distinguish faulty from correct grammar, punctuation, and 
word usage. Parts I and II are included in the battery as basic skills 
necessary in many vocations. 


Validity of the DAT has been studied in three ways: prediction of course 
grades, of achievement-test results, and of vocational and educational suc 
cess. The manual of this battery contains hundreds of correlation coeffi- 
cients in these three categories. Table 17.3 presents, in summary form» 
some of the major over-all findings when course averages in grades 8-12 
were used as criteria; that is, “concurrent validity.” 

DAT scores were correlated also with course averages in each of several 
subjects of study, obtained from six months to three and one-half yous 
after the test scores were obtained. These coefficients of predictive validity 
are of the same general order as those shown in Table 17.3. When DAT 
Scores were correlated with course averages of college freshmen, however, 
the results, as reported in the manual, were not promising. It appears 
therefore, that the DAT should not displace the scales intended sp€ 
cifically to select college freshmen, discussed in Chapter 16. 

When DAT scores were correlated with scores on standardized achieve 
ment tests, also as indexes of predictive validity, the results generally are 
somewhat better than those found with course grades. This is particularly 
true of Verbal Reasoning and Numerical Ability, and to a lesser degree of 
Abstract Reasoning and Language Il. As a result of their validatio? 


a eu the authors of the DAT justifiably conclude that the tests ° 
Students should consult and analyze these tables. 
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TABLE 17.3 


VALIDITY CORRELATIONS FOUND WITH THE DAT 
AND COURSE GRADES: ALL PARTS 
(Boys AND GIRLS) 


Range of medianr 
values for each of 


School subject the separate parts Lowest median Highest median 
English .21-.53 Mechanical reason- Language, Part II 
ing 
Mathematics .16-.52 Clerical Number ability 
Science .24—.55 Clerical Verbal reasoning 
Social Studies and .21-.52 Clerical and mechan- Verbal reasoning 
History ical reasoning 


ou a cR c o a Ó————( 


Verbal Reasoning, Numerical Ability, and Abstract Reasoning measure 
functions associated with general intelligence and, it should be added, are 
most useful as measures of scholastic ability. For example, the authors of 
the DAT report that the sums of scores on Verbal Reasoning and Nu- 


3 a 
a o 
$ 5 2 E: z g 
a 9 a z o a i 
<q a [3 ul a E =, 
E ul E ° = c Fr z 
[4 z [4 a 9g Ur a ul 
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> 2 E o = e 2 
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Fic, 17.4. Differential Aptitude Tests. A profile of scores. The Psychological 
Corporation, by permission. 
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merical Ability correlate with later schoolwork from .70 to .86. m 
high and speak well for the tests. Thus, although the more tes Ei 
are technically sounder than some of their predecessors, we led vae deme 
that verbal, numerical, and other types of test items requiring a stra $ 
reasoning, of the kinds long since in use in psychological hed ae 
ing to be most valuable in measuring general ability and educ 
romise. . 
n The manual of the DAT, unlike that of most other tests, provides nan 
on a follow-up study of 1700 individuals who were tested as high ae 
juniors and seniors. Later information obtained from them concer 
their educational and vocational careers since leaving high school, x: 
purpose being, again, to study the predictive value of the d ate 
battery. On the whole, the battery does not differentiate significa E 
among individuals entering different types of higher education. SF is 
ferent occupations, except on the basis of general level of wire 
Students in premedical, science, and engineering courses have the ug A 
average scores, with liberal arts students next in order. But there n ci et 
siderable overlapping of scores among these groups and vaniatan di 
scores among the members of each group. Nor is there a clear dew 
Scores to characterize each of these groups. It is significant to note, a i 
that the above-mentioned groups, and salesmen, beauty operators, i 
istered nurses, stenographers, and secretaries, had higher average T 
in Space Relations and Mechanical Reasoning than did mechanics In po 
electrical and building trades. This state of affairs is not necessarily a 1 
tributable to defects in the tests, since many years of research and o 
perience with psychological and educational tests have repeatedly yieldee 
the same general results, because human abilities are significantly ae 
related, even if not uniform. Differences in curricular and occupatione 
preferences and performance must, in most cases, be attributed to e 
Sensory traits, motor skills, and nonintellectual personality traits, as We 
as to differences in general mental ability. 
There are individuals, of course, whose profiles show marked dis- 
crepancies. These discrepancies can have various causes. In using the 
results of a multifactor test in educational and vocational guidance, ihe 
fore, the counselor must take the individual "case approach." This ar. 
proach depends upon adequate comprehension of the battery’s statistica 
evidence and psychological rationale, and upon knowledge of develop- 


E : A 
ment not only of mental abilities, but also of interests, values, and othe: 
Personality traits.7 


FLANAGAN APTITUDE CLASSIFICATION TEsts. 


* The authors of t 
ence 1, 


An elaborate approach to 


VM v h 1a 
he DAT provide, in this connection, a useful casebook. See refe 
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multifactor testing is found in this battery (FACT, for short), intended 
primarily for vocational counseling and employee selection. Norms were 
established for grades 9 to 12. Flanagan lists twenty-one relatively inde- 
pendent “critical job elements" which, in different combinations, are said 
to contribute to success in different types of occupations. How are these 
job elements discerned? Flanagan states: "The first step . . . is to develop 
a comprehensive list of the critical behaviors involved in the job or jobs 
being studied. These critical behaviors are obtained by determining sys- 
tematically which behaviors really make a difference with respect to on- 
the-job success and failure. The critical behaviors are then classified into 
job elements in terms of initial hypotheses regarding the precise nature 
of the aptitudes involved. The next step is to test the hypotheses that 
Specific types of variation in job performance are correlated with varia- 
tion on the related aptitude" (9, p. 2). "Each job element has been so 
defined as to be general in the sense that it is included in a number of 
Occupations, but specific or relatively unique in the sense that it measures 
Something different than the other job elements included in the list” 
(23, p. 67). 

Paper-and-pencil tests have been prepared for nineteen of the twenty- 
One elements; the remaining two, “Carving” and “Tapping,” are per- 
formance tests. These twenty-one elements and their corresponding tests 
constitute a rather heterogeneous assortment, involving quite different 
functions. Sensory, motor, memory, verbal, numerical, general vocabulary, 
and logical reasoning, and English usage, are among them. The following 
are several of the elements, with the rationale given by the test's author. 


Inspection: measures ability to spot flaws or imperfections quickly and 
accurately in a series of drawings of objects. (A test of visual perception of 
detail.) 

Assembly: measures ability to visualize how an object would look when 
a number of given parts are put together. (A test of three-dimensional space 
relationships.) r 3 
3 Judgment and Comprehension: measures ability to read with understand- 
ing and to use good judgment in practical situations. (A test of paragraph 
meaning.) . : x . i 

Ingenuity: measures creativity or inventiveness in devising ingenious 
Procedures, equipment, or presentations. (A reasoning test, based on a 
Stated problem.) . " 

Alertness: measures ability to size up a situation and notice that a dan- 
8erous situation exists, or that some specific action is needed. (A test of 
Perception of details in a picture and their interrelations.) 

Other types of FACT tests are briefly as follows: , 

Coding: speed and accuracy in coding typical office information. 

Memory: recall of codes learned in the coding test. 


422 MULTIFACTOR TEST BATTERIES 


Precision: speed and accuracy in making small finger movements. 

Scales: speed and accuracy in reading scales, graphs, and charts. 

Coordination: ability to coordinate hand-and-arm movements. 

Arithmetic: proficiency in the four fundamental processes. 

Patterns: ability to reproduce simple pattern outlines. 

Components: ability to identify important component parts in line draw- 
ings and blueprint sketches. 

Tables: reading two types of tables: one using numbers, the other using 
words and letters of the alphabet. 

Mechanics: ability to understand mechanical principles. 

Expression: ability to communicate ideas in writing and talking. 


Although each of the twenty-one tests is scored separately, the scores 
of various tests are combined to predict success in a variety of specific 
occupations, ranging from the professional to the farmer, policeman, and 
skilled worker. Since the separate tests represent such a diversity of func- 
tions and have been shown to be only weakly interrelated, the practice of 
combining their scores is seriously questioned. 

As for the validity of the tests of this battery, it appears that their au- 
thor employed construct validity, to begin with, although some ps4 
chologists would consider the initial step, in the quotation above, as being 
similar to face validity. Flanagan regards the job-element approach as 4 
method intermediate between analysis into primary (or pure) factors, 0? 
the one hand, and the job-sample method, whereby the essential elements 
of the real job are simulated in the test, on the other. In addition, how 
ever, statistical data are provided on validity. 

Since the elements are regarded as relatively independent, their inter- 
correlations should be low. In a ninth-grade group the range was from 
—.02 to +.57, with a median of +.20. In a twelfth-grade group the range 
was from —.og to +.62, with a median of +.31. This aspect of internal 
validity, then, is reasonably well satisfied. Table 17.4 presents validity 
correlations found with schoolwork (concurrent validity). The data In 
this table are not such as to encourage the use of the individual test$ for 
educational guidance of ninth-grade or twelfth-grade pupils. Although the 
combined scores of the nineteen parts of the battery yield quite significant 
multiple-correlation coefficients, the testing time required to obtain these 


Scores 1s more than seven hours. The same results have been achieved with 
fewer tests requiring much less time. 


After a five-year period, follow-up studies of college performance of 


students in a variety of fields of study, who were tested in high schoo 
Heg validity), yielded correlations ranging from .o4 to .65, with 4 
Fond 3 :89. The lowest coefficient was for "clergyman, missionary: 50 

orker,” while the highest was for “social scientist” (which include? 
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six vocations; for example, psychologist, lawyer, historian). In addition, a 
fairly large number of correlations are reported between individual tests 
and ratings in a number of occupations. Although there are occasional 
coefficients in the .50s and .60s, most are considerably lower; in some in- 
stances they are extremely low, approximating zero. In view of the correla- 
tions with achievement in school courses and of those with later educa- 
tional and occupational ratings, the vocational-guidance counselor will 
have to exercise considerable caution in the interpretation and applica- 
tion of scores obtained with the tests of this battery. 


TABLE 17.4 


VALIDITY CORRELATIONS OF THE FACT 
(INDIVIDUAL TESTS) 
Fac. TS SE eee 


Median 

Criteria correlations Range of correlations 

9th grade English 28 —05-48 (R = 63) * 
9th grade social studies 25 —.03-.41 (R = .58) 
9th grade science 27 01-.42 (R = .60) 
9th grade mathematics .28 —.04—45 (R = .62) 
12th grade English 34 —.02-.61 (R =.74) 
12th grade social studies 82 .03-.57 (R = .67) 
12th grade science 28 .03-.52 (R = .62) 
12th grade mathematics — 31 02-.52 (R = .61) 


ep: í E 3 
wi R in each case is the multiple correlation of the entire set of nineteen tests 
ith the criterion, 


GENERAL Aptitupe Test BATTERY. This battery (abbreviated GATB) 
Was first published in 1947 by the U.S. Employment Service, to be used 
Y counselors in the State Employment Services. It is based upon the 
assumption “, . , that a large variety of tests can be boiled down to sev- 
‘tal factors and that a large variety of occupations can also be clustered 
mato Broups according to similarities in the abilities required. This makes 
lt feasible to test all of a person's vocational abilities in one sitting and to 
Interpret his scores in terms of a wide range of occupations" (5, p. 22). 
e battery has twelve tests that yield nine "aptitude scores," and requires 
a little more than two hours to administer. These nine, which follow, 
Were identified by factor analysis. 
y the combined scores on three tests: three- 


Intelligence (G): measured b à 
and arithmetical reasoning. 


‘mensional space, vocabulary, 
Gray t 
to Guidance counselors and other advanced students of tests and testing should make a 

ugh study of the latest Technical Report of FACT. 
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Verbal Aptitude (V): the familiar test of synonyms and antonyms. . 

Numerical Aptitude (N): includes computation and arithmetical reasoning 
of the usual sort. 

Spatial Perception (S): tests ability to visualize three-dimensional objects 
when they are presented in two dimensions. 

Form Perception (P): tests ability to match drawings of tools in one part, 
and of geometric forms in another. 

Clerical Perception (Q): requires the matching of names. 

Motor Coordination (K): measures coordination by requiring individual 
to make specified pencil marks in a series of squares. 


Finger Dexterity (F): tests proficiency in assembling and disassembling 
rivets and washers. 


Manual Dexterity (M): measures hand dexterity by using both hands, and 
then the preferred hand, in placing pegs in a pegboard. 


This battery is intended for use in counseling individuals who are look- 
ing for occupations or who want assistance in the choice of vocational 
training. An individual's scores on each of the nine separate parts are 
matched against the minimum scores (cut-off scores) found to be desirable 
for each group of occupations. On the basis of standardization statistics, 
Occupational Ability Patterns (OPA) were established. Each pattern con- 
sists of three "key aptitudes" required by a "family of similar occupa- 
tions." One pattern, for example, consists of Intelligence (G), Numerical 
Ability (N), and Spatial Aptitude (S). Occupations covered by this pattern 
are those in "Laboratory Science Work and Engineering and Related 
Work." Basically, the inclusion of a particular "key aptitude" in any pag 
tern and “family” of occupations depends primarily upon its correlation 
with performance in those occupations, using as criteria the commonly 
employed ratings on the job (for example, production records, earnings 
work samples, supervisors’ ratings). 

Regarding validity, the results of a very large number of statistical 
studies are available (5). The findings are briefly summarized below. 

Since the GATB is a set of factored tests, the nine "aptitudes" shoul 
show low intercorrelations. They do this fairly well, the coefficients rang- 
ing from .0g to .81 for several groups, while the medians were from .29 t° 
:84- It is to be expected, of course, that high correlations were consist- 
enuy found between G and each of V, N, and S, since the G tests are the 
same kind as those used in the other three (see above). 

Tetrachoric correlations ? were calculated between aptitude-patter? 
Sale m nd" BUT of job criteria. These coefficients are, o5 E 
b c Ea ys or aa of the jobs listed, the number of cas iO 
ieee E ae e stan ard errors of the coefficients are often 

€ conclusions to be reached. 


* For this statisti 4 
Cized as bein pe sce Chapter 5. The use of this correlational method has been crit 
B less appropriate than others for the problem at hand. 
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GATB scores were correlated with grade averages for a fairly large 
number of college students in a variety of courses of study, using both con- 
current and predictive validity. On the whole, these coefficients are poor. 
The highest obtained were, of course, with G; but even these were in- 
ferior to correlations found with other scales, specifically devised for 
prospective college students. The GATB, therefore, cannot be regarded as 
à satisfactory substitute for testing such students. 

Quite aside from the vast amount of careful statistical analysis of data 
and its other technical aspects, merit and value of the GATB are found 
in the fact that they have been, and continue to be, validated against 
actual occupational criteria in a large number of specific jobs, each of 
Which is placed in a "family" of occupations. Since the tests are intended 
to select persons for placement in groups of occupations, the parts, or 
factors, to be included in the battery must be relatively small in number, 
ànd each must provide a significant measure for a number of specific oc- 
Cupations. The data on this battery, now available, are sufficient to war- 
rant its use in vocational counseling, provided the testing is followed by 
thorough interview, since some groupings of jobs include occupations that 
require abilities different from those tested by the battery, as well as re- 
quiring different nonintellectual traits of personality, which can prove to 
be important determinants of satisfaction or dissatisfaction, success or 
failure, in a given occupation. t 

OTHER REPRESENTATIVE BATTERIES. The batteries thus far described 
Present a comprehensive view of the instruments in this category; and 
they are among those that have received widespread attention. There are 
also a number of others that, like those described, have been constructed 
on technically sound bases. However, there is no need to present their 
Content, since the test materials are of the same psychological types, al- 
though there are some differences in the specific items and in scoring. 

cir criteria of validity and methods of estimating reliability are much 
the Same as those used in the other batteries. The major differences seem 
to be the extent to which emphasis is placed upon "factorial purity" 
(factoria] validity) and the extent to which daims are made for the dif- 
erential validity of each battery. Some of the claims are moderate and 
Warranted; others are excessive and unwarranted, on the basis of results 
9und, 
Among the batteries emphasizing 
Uilford-Zimmerman Aptitude Survey, 
“sts, and the Multiple Aptitude Test 
"t the hope of devising tests that wou 


"relatively unique" factors are the 
the Holzinger-Crowder Uni-Factor 
s (by D. Segel and E. Raskin).1° 
]d measure distinct abilities, un- 
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APTITUDE TESTS 


Definition 


4 An aptitude is a combination of characteristics indicative of an 
individual's capacity to acquire (with training) some specific knowledge, 
5kill, or set of organized responses, such as the ability to speak a language, 
to become a musician, to do mechanical work. An aptitude test, therefore, 
1S one designed to measure a person's potential ability in an activity of a 
Specialized kind and within a restricted range. 

Aptitude tests are to be distinguished from those of general intelligence 
and from tests of skill or proficiency acquired after training or experience. 
They should be distinguished, too, from educational achievement tests, 
Which are designed to measure an individual's quantity and quality 
" learning in a specified subject of study after a period of instruc- 

lon, 

The reader should note that aptitude is differentiated from skill and 
Proficiency, Skill means the ability to perform a given act with ease and 
Precision, Proficiency has much the same meaning, except that it is more 
Comprehensive; for it includes not only skills in certain types of motor 
and manual activities, but also in other types of activities as shown by the 
Extent of one's competence in language, bookkeeping, history, economics, 
mathematics. We may speak of one’s degree of proficiency in any type of 
Performance, On the other hand, when we speak of an individual’s 
btitude for a given type of activity, we mean the capacity to acquire 
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proficiency under appropriate conditions; that is, his potentialities at 
present, as revealed by his performance on selected tests that have predic- 
tive value. 

Furthermore, when we speak of a person's aptitude for a specified ac 
tivity, we do not make any assumptions regarding the degree to which it 
depends upon innateness or acquisition. An aptitude test is given to an 
individual in order to obtain a measure of his promise or essential teach- 
ability in a given area. Although they make no assumptions regarding 
the roles of “nature versus nurture” in this matter, clinicians and guid- 
ance counselors cannot ignore a person's past experience in evaluating his 
performance on aptitude tests. For example, one method of measuring 
mechanical aptitude is by means of a mechanical assembly test, utilizing 
various common objects such as a bicycle bell and a door lock. It is in- 
conceivable that a boy who in the past has had opportunity to manipulate 
such objects will not achieve a higher score than if he had not had such 
experience. Testing instruments measuring engineering aptitude include, 
for example, tests of simple mathematical relationships, scientific VO- 
cabulary, common scientific principles, and problems of practical me: 
chanical insight. Here again, an individual's performance will be in- 
fluenced by his previous experience. This aspect of aptitude testing and 
interpretation will become clearer as the reader becomes acquainted with 
the nature and content of aptitude tests. 

The principles underlying aptitude tests are the same as those employed 
with tests of intelligence in respect to sampling of performance, popula- 
tion samples, and standardization techniques. Therefore, we shall not 
present the several aptitude tests in statistical detail. It will be our pur 
pose, rather, to describe the kinds of activities or functions most com- 
monly examined by available tests of this type. 


Tests of Vision and Hearing 


Quite aside from the obvious desirability of having good o 
and hearing, there are numerous oeeupations and forms of learning 
which ene ar both, at a high level, are essential; thus, ihey are superb 
or elements, of certain aptitudes, Sensory deliciencies, Furthermore, UP 
adversely affect an individual's achievements in schoolwork or in his 
social and emotional adjustment. In some instances, therefore, these de 
ficiencies might be significant in cli poe 
cational guidance. The handicaps i 
EA defective vision are too obvious to be detailed. i 
\ wis indicated in Chapter i, a large part of psychological experi 
m in the nineteenth and early twentieth centuries was devote 
esearch on sensory acuity and discrimination. The purpose gE the 
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Studies in almost all instances, it will be recalled, was not to reveal the 
€xtent and nature of individual differences, but to establish general prin- 
ciples and laws. Although interest in this type of experimentation has 
continued, there is today much more concern than heretofore with ap- 
Plied experimental research for military and industrial purposes, involv- 
ing both sensory and motor aptitudes. This type of research is now called 
Industrial psychology" and "engineering psychology" or, in some in- 
Stances "human-factor analysis." The purposes of these researches are (1) to 
m wh ich sensory and motor resources are necessary, and to what degr ee, 
A certain specified types of occupations; and (2) to adapt designs of 
quipment of all kinds—military, industrial, household—for their most 
dfective and facile use by persons who operate or utilize them. To achieve 
Sse ends, knowledge of human sensory and motor abilities and their 
Variations is essential. 
eo i be nie ide information on the 
M ‘on and hearing, insofar as these may provi * ia d 
a fenes involved in learning to read, in working in the graphic arts, an 
ndertaking certain types of vocational education. , Á 
Ec of vision and hearing are used in the selection of industrial and 
NY personnel, in order to select those pesmi signs im y 
elimina auditory acuity, Or of color E Ea ge job hug 
Over, Es à those below the specified kien Jose ar F 
Quality is qe TUN E kas ol N tests of vision, as well as those 
nd quantity of production. Atso, Tenuto biis 


mo i 1 art in stu 
cid, tor skill, now have an important part in 5 


are interested, also, in research 


ents and their prevention. The 
Tests of Visual Acuity. Several opthalmological instruments 

able for large-scale testing of the following visual characteristics 
D These provide measures of 
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satisfactory vision or deficiencies requiring referral for professional ex- 
amination. 

Most persons have taken tests of visual acuity, the most familiar device 
being the crude Snellen chart. On this chart are printed rows of letters, 
varying in size, to be read by the subject. Each row and size has been 
standardized as recognizable at a specified distance by the "normal eye.” 
Visual acuity is expressed as a fraction. The numerator is the distance the 
subject stands from the chart. (usually twenty feet), and the denominator 
is the "distance value" of the smallest letter that can be read by the per- 
son being tested. Distance value of a given size is the distance at which a 
letter of that size can be read by the normal eye. Thus, if the smallest 
letter read by a person standing at a distance of twenty feet is read by the 
normal eye at forty feet, that person's vision (in that eye) is given as 20/40- 
The present Snellen chart, though still used, is a very inadequate test; 
for it will detect myopes (the near-sighted) but not the hyperopes (the 
“long-sighted” who can see distant objects with less muscle strain than 
near objects), or presbyopes (those whose eye accommodations are chang- 
ing with advancing age), or those badly handicapped by muscle im- 
balance. 

No one need labor the point that good vision is highly desirable. Hence 
in schools and industry there should be screening by means of a de 
pendable device to find those whose vision needs correction or whose 
color deficiencies must be given consideration in their education. k 

It is not unexpected that the data obtained by means of various testing 
devices have been factor analyzed (44). The factors inferred. from the 
results are given as retinal resolution (ability to distinguish points 1n 
the visual field; an attribute of acuity), accommodation, depth percep- 
tion, lateral and vertical muscle balance, convergence efficiency, bright- 
ness discrimination, and form perception. These characteristics of vision 
will not occasion surprise among persons familiar with physiological op- 
tics; for, as a matter of fact, they are the aspects of vision which the 
various tests and instruments were designed to measure. ; 

i Tests of Color Vision. All such tests depend upon the pti 
ciple that color-deficient persons confuse certain groups of hues, inter 56 
while a person with normal color vision distinguishes among them. Thiz 
Bo lg ae e 
cles. In one of he in le pos Ae d ade M e : aire 
color vision will see de die Lace a er ar ded m ret- 
blind eye will see onl dts Ps dpa n d nb ap ai d, an 
the red-green blind e P ill MG Se Se i ia svised 
Tr de qu GM = see neither. Another set of charts is so G€ the 

or vision will see certain numerals, whereas 


color- i 1 
or-deficient will not. These are called “pseudoisochromatic” charts: 


or plates. E J 
P The weakness of these tests lies in their requirement of a stan 
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ard illuminant for testing—a condition which almost never exists. For 
example, in the Navy, during World War II, about 50 percent of all 
color-deficients remained undetected after one to five medical examina- 
tions, despite instructions to examiners to exercise great care with illumi- 
nation. 

More recently, however, an illuminant-stable color-vision test has been 
made available (20), designed to overcome the errors produced by varia- 
tions in illumination, color fading, and card soiling. This test will yield 
stable results under varying conditions of illumination. Two other fairly 
recent tests are designed to differentiate among the colorblind, the mod- 
erately color defective ("color-weak"), the normal, and the superior in 
color discrimination (12, 13). A differentiation finer than a twofold 
classification is desirable, since color-weak individuals can do satisfactory 
work when only gross discriminations are needed.? 

Since examinations of men in the armed services have shown that al- 
Most 10 percent of all males are color-deficient in some degree, it is de- 
sirable to test all school children very early for this function. Deficients 
could be diverted, for example, from trying to become artists, geologists, 
clothes designers, etc. It is wise, also, to use a dependable color vision 
test in personnel divisions of department stores and of some industries; 
in the stores, to avoid placing color-deficient sales persons in the wrong 
departments; in industry, to avoid placing colorblind individuals on, for 
example, radio or other wiring that requires discrimination of a color 
code, For a job in which some color deficiency can be tolerated, the 
demands of the job in regard to color discrimination should be deter- 
mined, as well as the candidate's degree of deficiency. 

Auditory Tests. This function is measured by means of whis- 
Pered words testing the hearing of consonants and vowels (for example, 
the Andrews Whispered Speech Test); or, preferably, by means of an 
audiometer, The first of these consists of numbers that are whispered at 
a specified distance. An individual's score is the percentage of numbers 
Correctly heard, divided by the normal percentage. It is obvious that 
this type of test is unreliable because of the uncontrolled variables in the 
testing situation: for example, quality and intensity of the tester's voice, 
acoustical properties of the room, external sounds. : 

Audiometers, of which there are several types, provide much more 
accurate and reliable measures of acuity, for they are unaffected by any 
9f the disadvantages mentioned above. The subject wears headphones 
that shut out external sounds, and each ear is tested individually. Some 
audiometers use pure tones as stimuli; others reproduce recorded num- 


. 'An early and gross test of color vision was the Holmgren yarns. These require match- 
Ing small skeins of assorted colors against large sample skeins of green, red, and rose. 
* There are a number of other tests of visual acuity and of color vision, most of which 
9 Not provide adequate standardization and validating data. 
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bers, words, or sentences. In both types, the sound intensity of the stim- 
ulus is gradually decreased at small uniform intervals until the individ- 
ual's minimum perceptible intensity is reached. The minimum level is 
also checked by starting with a sound below the individual's threshold 
and increasing it until he is just able to hear it. When pure tones are 
the stimuli, different frequencies (number of cycles per second) are used 
at each of the various intensity levels in order to test for differential 
loss of hearing, if any exists. For example, a person's h caring may be 
within the normal range in the lower frequencies but defective in the 
higher. 


In order to facilitate the testing of large numbers of school children, 
group-testing equipment is available, using both speech and pure tones. 
Of the two, the latter are superior since they test at different levels of 
both frequency and intensity, whereas the former test only for ier 

It is not necessary to dwell on the importance of hearing in saD a 
learning (especially in reading and language) and in other forms O 
activity; yet its significance is sometimes overlooked. It is customary» 
therefore, in clinical cases of children and adolescents who present prob- 
lems in learning, to check on the individual's hearing and vision, as à 
matter of routine. , 

It is obvious that tests of vision and hearing do not measure a pe 
aptitude for specific types of learning and activity. For certain kinds o 
learning and activity, however, a given degree of visual or auditory 
acuity is essential. In that sense, then, these devices may constitute a part 
of a battery of tests used to measure a particular aptitude. 


Motor and Manual Tests 


SrRENGH oF Grip. One of the oldest instruments for the measuremen 
of individual differences in the psychological laboratory is the ban 
dynamometer for measuring strength of grip. The instrument consists 
of an inner and an outer handle, a dial, and a pointer. The subject grips 
these handles so that the second phalanges of the fingers press against 
the inner handle, while the outer handle presses against the heel of the 
hand. The subject then squeezes as hard as possible. Strength of g'!P a 
measured in kilograms. After many experiments, it appears that in psycho" 
logical work this instrument is useful principally as one device for caor 
mining degree of handedness and rate of fatigue. Since these two traits 


: : ; es > 4 to 
are involved in certain activities and occupations, they are relevant 
Some aspects of aptitude testing. 


js dies B é a 
Reaction Time. This is the time interval between the onset of 4 


stimulus and the beginning of the person's overt intentional response 
The particular stimulus and response to it are prearranged in an G 

p 1 ey . t d ) 
perimental situation. For example, the subject may be instructed to ta] 
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a telegraph key immediately upon perceiving a red light, the elapsed 
time between stimulus and response being electrically recorded in thou- 
sandths of a second. It is possible to devise a variety of tests, their par- 
ticular character depending upon which sensory and motor functions 
are to be measured. This type of test obviously is intended to measure 
speed of response in situations demanding immediate reaction, as in 
certain machine operations and in driving an automobile. 

MANUAL Dexterity. To achieve competence in activities requiring 
manual dexterity, speed of gross movements of hand and arm, manual 


rhythm and coordination, and finger control and coordination are neces- 


sary in varying degrees. For each of these purposes several tests have 
been devised, which vary in detail but are fundamentally alike. Gross 
movements of hand and arm may be measured in terms of the speed 
with which the subject picks up and places cylindrical blocks in holes in 
a board. Finger dexterity and coordination, necessary in rapid and accu- 
rate manipulation of objects, may be tested by measuring the rate at 


Fic. 18.1. The plier dexterity test shown here is useful in evaluating skill in 
the use of small tools and, in general, in evaluating aptitudes involving finger 
dexterity. The tray contains metal pegs which must be placed in the smal! holes 
In a prescribed order. The score is based upon the time required to complete 
the task. Sometimes the time required to remove the pegs also is included in the 


Score. (Acme Photo.) 
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which an individual, with fingers or tweezers, is able to pick up small 
metal pins or wooden pegs, of different shapes, and place them in the 
holes of a tray (see Fig. 18.1). Hand precision is measured by the accuracy 
with which a metal stylus can be placed into holes of small diameter cut 
in metal and electrically connected. Contacts of the stylus with rims of 
holes are electrically recorded and constitute the measure of inaccuracy. 
Occasionally, also, a paper-and-pencil test includes tasks designed to 
measure hand precision, such as speed and accuracy of tracing a path, 
speed of tapping, and placing a prescribed number of dots within a small 
circle (26). 

Other tests of manual dexterity follow the same general form, but some 
are more complex. For example, the Crawford Small Parts Dexterity Test 
consists of a metal plate having two sets of thirty-six holes in each (six 
rows and six columns). One set of holes is threaded and the other is 
smooth. The testee uses forceps to place a pin in each smooth hole, then 
a collar over the pin. In the threaded holes, the testee places a small 
screw, then tightens it down with a screwdriver (Fig. 18.2). The stated 
purpose of this test is to measure a combination of perception and dex- 
terity, in terms of rate of performance. 


Fic. 18.2. Crawford Small Parts Dexterity Test. 


by permission. 


The Psychological Corporation, 
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Fic. 18.3. The Stromberg Dexterity Test. The Psy- 
chological Corporation, by permission. 


Test also is a device whose purpose is more 
asurement of manual dexterity. It consists 


of a tricolored formboard. (six rows and nine columns), into which flat, 
cylindrical disks, variously colored, are to be placed in a prescribed order 
(Fig. 18.3). It appears that this test involves not only manual dexterity, 
but also gross color perception and a rather elementary level of nonverbal 
classification. The Minnesota Rate of Manipulation Test is similar in 
conception. 

Since manual dexterity scores on tests of this type are affected, in 


Varying degrees, by the subject's lateral dominance (the preferred use and 
Superior performance of one side of the body or the other), it is often 
d dominance. This procedure is 


desirable to use tests of eye-and-han 

Particularly indicated if we are concerned primarily with analyzing and 

understanding the person being tested, rather than with making selec- 
job (22). 


tons among candidates for a particular 
The Purdue Pegboard measures gross movements of hands, fingers, and 


arms, as well as finger-tip dexterity required in small assembly jobs. It 
Utilizes pins, collars, and washers that are to be assembled, using each 
hand separately, then both hands in coordination. 

_ Although the stated purpose of chennai ene wer Dextenity Tesi 
is to measure proficiency in the use of wrenches and screwdrivers, it is 
also a measure of manual and finger dexterity. The task is to take apart 


The Stromberg Dexterity 
complex than the simple me 
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twelve fastenings, placed in an upright board, according to a prescribed 
sequence, and to reassemble the nuts, bolts, and washers, with the use of 
a screwdriver and several wrenches. 

Coordination and rhythm of hand movements have been tested by a 
card-sorting test in which the subject drops playing cards through slots, 
using one hand at a time or both hands together. Another device is the 
two-hand coordination test in which the individual attempts to move 
both handles of a mechanism simultaneously in such a way as to keep an 
upper disk over the lower one, which moves in an unpredictable manner. 
Another two-handle device is employed in testing a subjects ability to 
follow an irregular path without touching the sides. 

During World War II, numerous psychomotor tests were used by army 
and navy psychologists to assist in the selection of men for specific types 
of training, especially in the Air Force. These tests involved more difficult 
operations than those described above, often requiring rapid and com- 
plex sensory-motor coordination, such as the following: the use of both 
hands simultaneously in manipulating two lathe-type handles to follow 
a target that moves in an irregular path; obtaining patterns of lights by 
manipulating stick and rudder in a simulated airplane cockpit; reacting 
to four different relative positions of a red light and a green light by 
pushing one of four switches arranged in a square pattern before the 
subject; moving a wheel, resembling an airplane control, in and out of 
its shaft, in order to hold a horizontal bar in the center of a circular 
aperture; causing a beam of light to follow a given course when the 
horizontal movement is controlled by one lever and the vertical by an- 
other. 

Evaluation. Tests of sensory capacity and those of motor and 
manual dexterity have been moderately useful in selecting persons for 
specific types of training or for particular jobs. The functions and activ- 
ities measured by these sensory and motor tests are practically unrelated, 
1t appears, to the mental functions measured by tests of general ability: 
for the many correlational studies made between sensory and motor tests, 
on the one hand, and tests of intelligence, on the other, have yielded 
coefficients that are low, some being negligible. It has been concluded, 
therefore, that these two types of psychological instruments measure func 
tions that are largely independent of one another. 

Factor analyses of data obtained with tests of motor activity have PFO 
duced results that are consistent with what would be expected by on 
who is familiar with these devices. The more significant factors appear 
to be the following. 


Fine coordination: fine 
Arm-hand steadiness: 
and hand. 


ly controlled motor adjustments, using large muscles. 
unspeeded, precise coordinated movements of arm 
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anual dexterity: speeded arm-hand movements with large objects. 
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Fic. 18.4. The Discrimination Reaction Time Test. 


This test was designed to measure how quickly individuals make differential 
manual responses to visual stimulus patterns differing from one another with re- 
Spect to the spatial arrangement of their component parts. The test requires that 
the candidate react by pushing one of four toggle switches in response to the 
lighting of a red and a green signal lamp. The position of the red lamp with 
respect to the green determines which of the four switches should be pushed. 

A front view of a single test unit, with designations of lights and switches, is 
shown in the figure. The four stimulus lamps, two red and two green (Lı, Lg, 
L3, L4), are arranged in the form of a square on the vertical panel facing the 
candidate, The stimulus to which the candidate must react by operating one of 
the four toggle switches is the simultaneous lighting of one of the red lights and 
one of the green lights. If he operates the correct switch, the white signal lamp 
(which lights on every trial) is extinguished immediately, signaling the candidate 
that he has made the correct response. The colored lights do not go out until they 
have been on for 3 seconds, regardless of how quickly the correct switch has been 
pushed. 

The four spring-return toggle switches (S1, S2, Ss, S4) are so set that the 
Candidate must push each one in a different direction. The four directions of 
movement correspond to the four signal patterns formed by the lighting of the 
red and green lamps. Thus, if L1 and Ly are lighted, the red is "up" with respect 
to green, and the upper switch, S1, must be pushed up. If Lg and L4 are lighted, 
the red is to the right of the green, so the switch on the right, S2, must be pushed 
to the right. The time taken to operate the correct switch on each of a series of 
test trials is accumulated on an electric stop-clock and constitutes the candidate’s 
Score, 
ben ae Tesis, Report No. 4, ey Air pm Aviation Psychology 

> y A. W. Melton. U.S. Government Printing Office, 1947. 
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Fic. 18.5. The Two-Hand Coordination Test. 


idate's 

The Two-Hand Coordination Test was designed to IDEAE a — 
ability to coordinate the movement of both. hands, He is required = E Saville 
movements of a target-follower in response to a visually perceived targ 
at varying rates along an irregular pathway. * ' in the figure. 

A single test unit, as seen from the candidate's position, is shown c the lef 
Two handles which he manipulates are seen in the foreground and a Ln die 
Rotation of the upper handle causes a contact point, which is mounted pen 
leaf of a microswitch, to move toward the candidate with PUEDE ide 
tion and away from the candidate with clockwise rotation. Rotation of t me E 
handle in a counterclockwise and clockwise direction causes the contaci P cously 
move to the left and right, respectively. Rotation of both handles pares jur 
causes the contact point to move in any desired direction in the plane o ach a 
ment of the target. A candidate's task is to manipulate the controls n ct) a 
Way as to keep the target-follower on top of a round brass button (the t 28 the 
it moves along an irregular clockwise path. When the contact point E 21k 
target button, the microswitch is closed and current flows to an electric adi 
located on a remote control desk. The time which is accumulated on idet 
during a series of eight 1-minute trials indicates the efficiency of the candi 
performance, 

From Apparatus Tests, Report No. 4, Army Air Forces Aviation Psychology 
Program, edited by A. W. Melton. U.S. Government Printing Office, 1947- 
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Fic. 18.6. Bi-Manual Planned Pursuit Test. 


The Bi-Manual Planned Pursuit Test was designed and developed to measur 
ability to coordinate the activities of both hands by a systematic shifting of a 
tention, The test consists of an irregular polished brass pathway which mov 
beneath two pointers. The pointers are separated by a distance of 8 inches. Th 
Pointers are adjustable by the candidate by means of two vertical handles, th 
candidate being required to keep the pointers (one with each hand) in contac 
with the moving pathway. In view of the fact that a limited amount of the patl 
Way is visible prior to reaching the contact pointers, it was believed that a certai 


amount of planning could occur and would in part determine the score on th 
z-minute trials. Rest periods of unspecified duratioi 


test. The test consists of six 1.5 ] : 
Probably about go seconds, occurred between trials. The score is the length 


time during which both pointers are on the pathways. (t 
From Apparatus Tests, Report No. 4 Army Air Forces Aviation Psycholog 
Program, edited by A. W. Melton. U.S. Government Printing Office, 1947. 


Rate control: continuous motor adjustments to changes in speed and direc- 


tion of a moving target. 4 Tues 
Reaction time: the well-known rate of response to a BiG can us. 
Response orientation: selection of the correct response under speeded con- 


ditions; a complex of sensory-motor behavior. 


Although these derived factors do not add new aspects of, or insigh 
into motor activity, they should contribute to the development of specif 
Motor tests that could be applicable to various types of occupations. Th 
Predictive validity of new tests based upon this or similar analyses wi 
have to be established. t 

Thus far, reliability data have not been as high as they should be, th 
Coefficients being, for the most part, in the 708 and .8os. These lo 
Indexes result, in part, from practice effects. Mihis fact suggests that th 
Usual correlational methods of estimating reliability are not altogeth« 
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appropriate with these operations. It would be more useful to evaluate 
initial scores on motor tests in terms of expectancy tables; that is, to 
answer these questions: "In certain kinds of individuals, how much 
improvement may be expected in these functions after specified amounts 
of training?" and “At what level can cut-off points of initial scores be 
set, below which an individual is a poor risk for training?" Answers to 
these questions will have a bearing upon both reliability and validity of 
the tests. The reason is that validities of motor tests found with criterion 
tasks have been shown to be most satisfactory in jobs requiring only 
routine assembly and machine operations; that is, the types of work in 
which there is little room for improvement with practice. Some of the 
more complex types of work, which require sensory-motor skills, also re- 
quire a higher degree of mechanical comprehension and general mental 
ability; thus, they provide opportunity for improvement with training. 
For these occupations, sensory-motor tests, in themselves, are inadequate 
as predictors. Expectancy tables could indicate the minimum initial 
motor-test scores acceptable for occupations at the several levels in which 
sensory-motor skills are involved. It is probable, too, that the determina- 
tion of such cut-off scores could be used in the guidance of high school 
pupils, as well as in screening applicants for jobs. 


Tests of Mechanical Aptitude 


The capacity designated by the term "mechanical aptitude" 1s 
not a single, unitary function. It is a combination of sensory and motor 
capacities, such as those already described, plus perception of spatial 
relations, the capacity to acquire information about mechanical matters, 
and the capacity to comprehend mechanical relationships. Thus, tests of 
mechanical aptitude are designed to measure capacity and performance 
on a higher level of organization than those of sensory-motor capacity 
and dexterity. 

The Assembly Test of General Mechanical Ability devised by JE 
Stenquist (1923), the first of its kind and now of little more than his- 
torical interest, was intended to measure a person's ability to put together 
the parts of mechanical devices, among them a bicycle bell, a double- 
action hinge, a door lock, and a mousetrap. This test, consisting of three 
series, was constructed for use with individuals covering the age range 
from children in the lower grades through adulthood. 
ep See rests have been revised and extended at the University 
bly Test. a SEU known as the Minnesota Mechanical Asse 
TE fe PP essentially the same as Stenquist s tests, 

anical devices having been retained, with new 
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ones added. Performance on these tests—scored in terms of rate and 
accuracy of work—has been found useful in predicting success of junior 
high school boys in shop courses. Also, facility in these assembly tests 
has been found by some investigators to be one significant indication of 
a person's aptitude for a number of occupations such as machinist and 
auto mechanic. : 

The Minnesota Spatial Relations Test (1939) consists of a series of 
four boards, each of which has 58 cutouts of various shapes, many of 
them unusual. The subject’s task is to replace these in their correct holes 
in the board. Evidence indicates that persons engaged in mechanical 
occupations tend, as a group, to earn higher scores than do persons in 
nonmechanical occupations. This fact, it appears, is a principal justifica- 
tion for use of the test as a measure of mechanical aptitude. Some critics 


Fic. 18.7. Triform Pegboard Test. Reproduced from (31). 


4 


Fic. 18.8. Minnesota Spatial Relations Tests. The up- 
wts represent the formboards that are 


per and lower p: : 
f nted in the middle part. 


filled with the pieces represe ni 
Educational Test Bureau, by permission. 
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"First look at Problem 1. There are two 
parts in the upper left-hand corner. Now 
look at the five figures labelled A, B, C, 
D, E. You are to decide which figure shows 
how these paris can fit together. Let us 
first look at Figure A. You will notice that 
Figure A does not look like the parts in 
the upper left-hand would look when 
fitted together. Neither do Figures B, C, 
or D. Figure E does look like the parts in 
the upper left-hand corner would look 
when fitted together, so E is PRINTED in 


the square above m at the top of the 
page." 


Fic. 18.9. Specimen item from the Revised Minnesota Paper Formboard 
Test. The Psychological Corporation, by permission. 


of the test have concluded that it is adequate as a measure of speed and 
accuracy in responding to details of spatial relations and that it yields a 
ves sie individual's capacity to work with a variety of details in 
ing objects and concrete materials. On the other hand, it is not 
adequate for measuring resourcefulnes in solving problems of a me- 
chanical nature, nor for measuring capacity to manipulate small objects 
with precision. P 
mM V Paper Formboard (1948) is, as its name in- 
inae Geta Ets uces in printed form the same type of problems 
shown two or mor Ue ea formboards. In each problem, the subject E 
thi paved ill Pus por of a geometric figure; when correctly assembled, 
the correctly assembl S Rupe figure. It is the subject's task to identify 
signed, it appears, t ur oe from among five choices. This test is de- 
manipulate Perieni SACS one's capacity to visualize and imaginally 
UN Me the to si orms. Reported research has shown this paper 
mechanical Tora b ers good correlation with quality 9 
in mechanical drawin E did x moderate-to-low correlation with succes? 
in engineering us ABA escriptive geometry. Criteria included grades 
ratings; and EON ne shop, and mechanical courses; supervisor $ 
ecords. As a group, students in engineering 2? 


u 
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mechanical vocations obtain higher scores than do other groups of stu- 
dents. Available evidence, however, demonstrates that this test does not 
have high enough predictive value to be used exclusive of other criteria 
and information. 

Several pencil-and-paper tests are intended to evaluate mechanical 
aptitude by testing for specific mechanical information, specialized vocab- 
ulary, and ability to perceive and deal with practical mechanical prob- 
lems. It will have been noted that the paper formboard type of test and 
its variants are measures of spatial perception, a psychological function 
commonly included in multifactor batteries, discussed in Chapter 17. 

The Tests of Mechanical Comprehension (Bennett et al.) present 
mechanical problems in pictorial form. In each instance, accompanying 
the picture is a statement of the problem depicted, with two or three 
answers from which to choose the correct one. These tests, on three levels 
of difficulty, are designed to measure one's understanding of the opera- 
tions of physical and mechanical principles in relatively simple situa- 
tions. One form is designed for use with high school students, engineering 
School applicants, and, in general, with relatively untrained and inex- 
perienced persons. A second, and somewhat more difficult form, is in- 
tended for use with engineering school candidates and applicants for 
technical courses or employment in technical jobs. The third form was 


DRIVER 


Fic. 18.10. "Which gear will make the most 
turns in a minute?” From the Bennett Test 
of Mechanical Comprehension. The Psycho- 


logical Corporation, by permission. 


devised for use with high school girls and women. Since the items should 
be appropriate to the level and experience of each group of examinees, 
Many of those included in the test for women are related to household 
activities, involving objects and devices used in a home rather than in 
à shop, 

Unlike some other tests of mechanical comprehension, this one does 
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Which table is more likely to break? 


Fic. 18.11. Specimen item from the Test of Mechanical 
Comprehension by G. K. Bennett and D. F. Fry. The 
Psychological Corporation, by permission. 


not require specific knowledge, such as matching the parts of a tool; 
nor does it require verbal knowledge of tools, processes, or materials. The 
items show objects that are almost universal in American life, such as 
airplanes, ladders, stairs, wheels, gears, and pulleys. The principle used 
is that answers to the problems presented by the test do not depend upon 
specific information or training but can be arrived at by analysis of the 
materials shown. The extent to which this hypothesis is satisfied varies 
among the sixty items. For example, familiarity with principles of physics 
or actual experience will be helpful in answering questions involving 
pulleys and leverage. Yet, individuals without these advantages, whose 
analytical ability is adequate to the task, should be able to answer cor 
rectly. 

From time to time, new tests have appeared in this area, but they have 
not introduced new concepts or new techniques. The Mellenbruch Me- 
chanical Motivation Test, for example, is a picture test requiring recog- 
nition of and information about certain objects commonly encountered 
in our environment (for example, faucet, electric razor, towel rack, pulley 
wheel). The task, actually, is to match pairs of objects that belong to 
gether. The assumption underlying this test is a very simple one: that 
individuals with mechanical interest and capacity, more than others, will 
observe the uses, parts, and relationships of these objects. “Mechanical 
motivation” is thus inferred. 

Another relative newcomer is the SRA Mechanical Aptitudes Test. Its 
three subtests are devised to measure mechanical information (names 
and uses of tools), form perception and spatial visualization (similar to 
the Minnesota Paper Formboard), and solution of problems in shop 
arithmetic (including use of tables and diagrams). These three subtests 
were selected on the basis of face validity; for they are regarded by their 


d e as significant in, and hence applicable to, a variety of mechanica 
Jobs. i 
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The Employee Aptitude Survey (Ford ct al.) utilizes the same types of 
subtests as the tests already described in this and the. preceding chapters. 
Its tests of visual perception, however, include one of "visual pursuit," 
which is uncommon (see Fig. 18.12). The authors regard this as a sig- 
nificant measure of perceptual ability for such persons as draftsmen, 
design engineers, electronics technicians, ". . . and other personnel whose 
work requires the use of complex schematic diagrams." 


Look at the example below. The problem is to follow each line through with 
your tyes from its number al the right to the box where it comes out on the left. 
The first three items have been marked to show you how. Line 1 has been drawn 
extra heavy so that you can trace it more easily. Trace lines 1, 2, and 3 for yourself, 
to make sure that the correct answers have been marked. 


[4] iid 


=] 
Lm 


[c] 
[e] 
B 
[4] 


Fic. 18.12. Sample of visual pursuit item from Employee Aptitude Survey. By 
permission. 


Evaluation. It is evident from the previous descriptions that 
mechanical aptitude should be regarded as a complex of several func- 
tions in the measurement of which some tests are limited to only one 
Or two aspects, while others are more comprehensive. a 

On the whole, the tests of mechanical aptitude are statistically re- 
liable, the correlation coefficients being, for the most part, in the .8os. 
As usual, of course, the indexes for some subtests are lower than for the 
tests as a whole. The modal interval of their validity coefficients, when 
reported, is .40—.50, but evidence of validity, in some instances, is quite 
inadequate. If we regard marks in high school shop courses, scores of 
Occupational and educational groups (mechanic versus nonmechanic), 
and low correlations with tests of general intelligence as criteria, then we 
Can say that some of the available tests in this field have a reasonable 
degree of validity for purposes of educational guidance. On the whole, 
by comparison with tests of intelligence, those of mechanical aptitude are 
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inferior in respect to technical level of standardization and predictive 
value in actual performance. 

The Wrightstone and O'Toole test may be cited for its relatively 
favorable statistical evidence. Its reliability is high, about go. Inter- 
correlations of subtest scores range from .go to .70, with a median of .55. 
Correlations of subtest scores with total scores vary from .52 to .77, with a 
median of .7o. Evidence of validity is offered in two ways: in terms. of 
face validity and of correlations with instructors' ratings in a training 
course for aviation mechanics. In the first instance, the test's authors state 
that their device measures the skills prescribed as necessary in eleven 
mechanical occupations. In the second instance, the coefficients of con- 
tingency ? vary from .60 to .78, with a median of .67. These coefficients are 
among the highest in this area of investigation, and considerably higher 
than most others. , 

Numerous studies have been published in which tests of mechanical 
aptitude have been intercorrelated. The reported coefficients are almost 
uniformly low (below .50). A few of the reported coefficients are mod- 
erate; that is, somewhat above .50. The reasons for these relatively low 
coefficients—unlike those found between the sounder tests of intelligence 
—are to be sought in the following factors. (1) Some of the tests are 
much more comprehensive in scope than others that are relatively re- 
stricted and homogeneous in content; hence, the former measure a greater 
number of functions, some of which may have little communality with 
the latter. (2) Not all of these tests are calibrated for the same levels of 
difficulty; hence they do not have equal differentiating value at a given 
level. (3) Some of the tests are much more dependent upon experience 
and specialized information than are others. (4) Performance levels on 
several tests may reflect different degrees of interest and motivation 1n 
special areas. 

In connection with (1), above, study of the content of tests of me- 
chanical aptitude shows that they sample, more or less, the following 
functions: visualmotor integration, spatial visualization, perceptual 
speed, manual dexterity, and visual insights (analysis). In addition t9 
these, some of the tests measure specialized information, knowledge of 
techniques, arithmetical problem-solving ability, and technical vocab- 
ulary. Some of the functions are measured by means of apparatus tests 
(Figs. 18.4, 18.5, 18.6), others by means of performance-type materials 
(formboards, etc.), and still others by means of pencil-and-paper tests. It 
is not surprising, therefore, that intercorrelations between these tests are 
low, even though they are placed within the same category. 


3 H H 1 " i 
A "This coefficient is derived from data arranged in several categories rather than in 
intervals of scores of a variable. 
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On the whole, tests of mechanical aptitude show moderate correlations 
with actual job performance. This fact does not necessarily signify that 
the tests themselves are defective, for many nonmechanical factors affect 
both the job ratings and the actual performance on the job. These factors 
include subjective judgments of the raters, as well as the worker's health, 
motivation, and personality traits that may facilitate or impede perform- 
ance. 

Although the foregoing considerations lower the correlation coeffi- 
cients, it is highly improbable that they are solely accountable for the 
findings. It is essential, therefore, that a given test of mechanical apti- 
tude be studied for each type of occupation for which its use is being con- 
sidered, in order to establish its validity, not only in terms of correlation 
coefficients but, more significantly, in terms of critical cut-off scores and 
of expectancy tables for each of several levels of test performance. As 
in the case of some multifactor batteries, too much emphasis is placed on 
correlation coefficients, while the overlapping of scores among criterion 
Broups and the probabilities that could be shown in expectancy tables 


are neglected. 

Some of the tests of mecha 
or educational guidance is the problem at hand 
supplements to other kinds of information. For example, the several 


tests of mechanical comprehension are quite useful for selection in situa- 
tions where understanding of machines is necessary. And it has been 
found, in fact, that this type of test material is one of the most satis- 
factory for selective and predictive purposes in this area.* In any case, in 
Buidance work it would be desirable to administer more than one test 
9f mechanical aptitude, since intercorrelations of the several instruments 
are not high enough to warrant their use interchangeably. The particular 
Combination of tests used in any given situation should depend upon 
the nature of the problem presented by the individual concerned and 


Upon the kinds of jobs under consideration. 


nical aptitude are useful when vocational 
, for they are valuable 


Tests of Clerical. Aptitude 


Description. Clerical aptitude, like mechanical, is not a uni- 
'ary function. The tests consist of several kinds of items, some of which 
Correlate quite highly with scores on tests of general intelligence but 
differ from the latter in that they contain selected materials that are 
Significant in clerical occupations. 


ension involve visual analysis of, and insights into, 
balization—they appear, also, to be testing general 
anical and scientific materials. 


der 
T Since tests of mechanical compreh 
hOnverbal materials—facilitated by ver! 
Mental ability through the use of mech 
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It will be recalled that two of the multifactor batteries (the DAT and 
the GATB, described in Chapter 17) include subtests intended specifically 
to measure clerical aptitude. In this chapter, several instruments, re- 
stricted to this aptitude, will be presented. All the clerical tests have 


TABLE 18.1 


SUBTESTS IN Six TESTS OF CLERICAL APTITUDE 


Test Subtests 


handwriting: rate and quality 
checking: rate and accuracy 

simple arithmetic 

motor speed and accuracy 

knowledge of simple commercial terms 
disarranged pictures 

classification: rate and accuracy 
alphabetical filing 


Detroit 


matching: detecting errors in names and numbers 
alphabetizing and filing 
arithmetic: simple calculations 
arithmetic: locating errors in addition 
General Clerical arithmetical problems 
spelling 
reading comprehension 
word meaning 
language usage: grammar 


number comparison 
name comparison 


Minnesota 


spelling 
computation 
Purdue checking: speed 
word meaning 
copying: accuracy 
reasoning 


numerical operations 
Short Employment word meaning 


classification and filing 


verbal skills 
number skills 
Torne written directions 
checking: speed 
classification and sorting 
alphabetizing 
SS Ee AU adiu tulo má 
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much in common, both as to functions being measured and as to actual 
content. These facts are evident in Table 18.1. The principal differences 
are the scope and the detailed information provided by each test. The 
Detroit and the General Clerical are the most comprehensive; the Minne- 
sota is the most restricted. 

The more comprehensive instruments attempt to measure a variety 
of specific operations, whereas the Minnesota attempts to measure only 
perceptual speed and accuracy, which is only one aspect of clerical work. 
Restricting a clerical test thus rests upon the finding that these functions 
are basic to clerical work. But since the designation "clerical work” covers 
such a diversity of jobs, both in range and level of aptitude, it is desirable 
to use tests that include measures of other essential mental operations 
(and manual dexterity, if necessary) as well. Thus, as in the case of 


1937-937 (e xa 
2 obh-obh (- ay 
3. Curtis & Co. - Curtis&Co.. ( ) 
4 O©O-OO ©») 


57. 6819002341-6839002341 ( ) 
58. vrxoaediqf-vrxoaediqf (Qd 
59. H. W. Hieronymous- H. W. Hiernymous ( ) 
60. Z7 ZASAVA-AZZZEZSNA €) 


Fic. 18.18. Rate and accuracy in perceiving similarities and differences. 
From the Detroit Clerical Aptitudes Examination. Public School Publish- 


ing Company, by permission. 


mechanical-aptitude tests, those of clerical aptitude should be validated 
against specific types of clerical occupations; and, indeed, it will be de- 


Sirable, at times, to validate an instrument for use in individual com- 
, 


panies. 

Analysis of the subtests generally included in these devices yielded the 
following three factors: comprehension of verbal and numerical rela- 
tions, perceptual analysis, and rate of making simple visual discrimina- 
tions. In view of the content of current clerical aptitude tests (Table 


18.1), the emergence of these factors should occasion no surprise. 


Evaluation. On the whole, results obtained with clerical- 


aptitude tests, if critically interpreted, will contribute to a better under- 
S, i 
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standing of a pupil's capacities and to his guidance in the selection of 
a high school course, although the tests correlate only moderately with 
marks in commercial courses (generally from about .30 to .50). But mod- 
erate correlations between test results and school marks are the rule 
rather than the exception, for the size of the coefficients is affected by 
several factors external to the tests: pupils’ lack of interest or incentive 
in their courses; interference of nonintellectual factors, such as poor 
health, economic pressure, emotional forces, and extracurricular activ- 
ities; and the variability of teachers’ marks. 7 
Although their reliability coefficients are generally within the satis- 
factory range, tests of clerical aptitude do not provide unequivocal evi- 
dence of their general value for the prediction of competence and quality 
of performance on the job itself. Validity correlations generally fall be; 
tween .20 and .45. However, as so often happens even when validity 
correlations are low, the tests are useful in identifying those persons at 
the higher levels and those at the lower. Here again, cut-off scores and 
expectancy tables can be more revealing than correlation coefficients. 
Authors of these tests provide, among other data, norms for a variety 
of groups—for example, high-school seniors in commercial courses, office 
workers, nonoffice workers, different classes of clerks, employed and un- 
employed office workers. Norms for each of these several groups indicate 
the expected average and range of scores in each instance. The range of 
scores within each of the groups, however, and the extent of overlapping 
of scores between any two groups are so considerable, and the validity 
correlations are low enough, so that in any individual case a detailed 
analysis of test results must be made, not in isolation but in conjunction 
with other information concerning the individual under study. ; 
Not all available tests of clerical aptitude are of equal technical merit. 
The norms of some are inadequate as to numbers in or diversity of the 
standardization population, or both. Some present quite inadequate 
validation data. On the whole, tests of clerical aptitude have not reache 
the technical standards of tests of general intelligence. Prospective users 
therefore, should carefully scrutinize the manual of any clerical aptitude 
test they contemplate using. 
In view of the rapidly changing nature of clerical work in large 
organizations because of the introduction of machine calculators and 
other forms of automation, it will be necessary to reanalyze many types 


of clerical jobs and to revise existing tests accordingly, or probably s 
create new and different ones.5 


D 3 DM 
For a comprehensive list of tests of mechanical and clerical aptitudes see O. K. Buros 


d.). i 
Th ee Mental Measurements Yearbook, Highland Park, N.J; The Gryphon 
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APTITUDE TESTS: FINE ARTS 
AND PROFESSIONS 


Aptitude in Music 


Since scores on tests of general ability and grades in the ua 
school subjects are not well correlated with the specific psychologica 
requirements in music or in the graphic arts, it is necessary tO have 
special measures for them. In view of the current emphasis upon ox 
tifying elementary and secondary school pupils of superior intellectua 
promise for higher education in the sciences, mathematics, technologies 
and in the humanities, it would be desirable to have instruments io 
assist in identifying pupils of superior promise in the fine arts. A rela- 
tively small number of tests are available for assessing aptitudes in music 
and the graphic arts, but they are not so well developed as those of £€?" 
eral intelligence or of educational achievement. Nor does there seem to 
be wide interest, among psychologists, in developing them. Perhaps INS 
reasons for this state of affairs are the difficulties inherent in defining 
these aptitudes, and the highly subjective judgments in evaluating mozg 
and their products. The tests that are available, however, have made 4 
contribution and can be of assistance. 

The earliest tests in the field of music are the Seashore Measures of 
Musical Talent, intended for use from grade 4 through the college level. 
The six aspects of hearing, measured by means of phonograph record- 
ings, are as follows: 
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Pitch discrimination: judging which of two tones is higher, graded from 

readily discernible differences to very fine ones. 

Intensity or loudness discrimination: judging loudness of pairs of notes, 

with gradation differences. 

Time discrimination: judging whether the duration of a note is longer or 

shorter than the duration of the same note sounded a second time. 

Discrimination of timbre: judging whether two tones are the same or differ- 

ent in quality. 

Judgment of rhythm: paired rhythmic patterns to be discerned as being the 

same or different. 

Tonal memory: paired tonal patterns are played; ability to perceive the 

difference between the members of each pair is tested. 

"Total scores for the six parts are not used to represent an individual's 
rating; instead, a profile representing the scores on each of the six audi- 
tory aspects is prepared. d : 

Seashore approached the development of his measures from a point 
of view different from that of authors of most other types of aptitude 
tests. Instead of making a “job analysis," in an attempt to discriminate 
levels of musical aptitude on an empirical basis, he made a theoretical 
analysis of musical talent into its sensory components. 'Thus, he em- 
ployed the principles of content and construct validity. Some of the 
components, he held, can be measured objectively, whereas others can- 
not. The six capacities listed are among the measurable ones; but as 


Seashore indicated, they do not provide measures of all components of 


musical aptitude. The tests measure auditory aspects regarded by him 


and others as fundamental to the development of musical proficiency. 

For the purpose of validation, results of the Seashore measures have 
been compared with teachers' ratings of musical ability, with musical 
achievement, and with quality of work in schools of music. The obtained 
correlations have indicated that these tests of auditory perception are 
Not sufficiently valid in predicting various levels of musical talent. Sea- 
shore himself, however, has objected to attempts at an over-all validation 
of his measures. He maintains that each of the six tests should be du 
arately validated against different kinds of specialized musical activity. 
For example, the test of pitch discrimination should be validated espe- 
cially for players of string instruments. a s 

This much at least may be said: the Seashore tests identify those 
Persons whose auditory capacity is so deficient that they could not suc- 
cessfully engage in the formal study or performance of music. 

The Wing Standardized Tests of Musical Intelligence, for the ages of 
8 years onward, is intended, among other uses, to overcome a frequent 


objection to the Seashore tests: that the latter are “atomistic” and thus 


do not represent an actual musical experience. The Wing tests measure 
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seven aspects of musical perceptiveness, yielding a score for each. Test 
performance is represented, however, by a single total score, since “mu- 
sical intelligence” is regarded as a unitary, though complex, aptitude. 

The seven aspects are chord analysis, pitch change, memory, rhythmic 
accent, harmony, intensity, phrasing. These are held to be more closely 
associated with the actual teaching of music than are the Seashore tests, 
since Wing's are standardized upon individually constructed types of 
tests employed by teachers and examiners in music. The Wing tests, 
therefore, are considered by many teachers of music to be more relevant 
to the selection and training of individuals in that art. 

The Drake Musical Aptitude Tests measure two aspects: musical mem- 
ory and rhythm. The first of these tests ability to remember two-bar 
melodies, and the second to maintain mentally and silently a metronomic 
beat. These, also, are intended for use with individuals of 8 years and 
older. 

The Aliferis Music Achievement Test differs from those already de- 
scribed in that it has been devised for use with entering college-freshmen 
students of music. The functions tested are not unusual; they are “audi- 
tory-visual discrimination of melodic, harmonic, and rhythmic elemenis 
and idioms.” 

Evaluation. The foregoing tests of aptitude in music are rea- 
sonably reliable measures of the functions they include, when total scores 
are considered. The part-score reliabilities (split-half) of the Seashore, 
however, vary from a low coefficient of approximately .60 to a satisfactory 
high of about -go. These data signify, of course, that the part scores 
often are not reliable enough to discriminate, except grossly, among the 
several sensory functions within an individual. The part scores, however, 
may still be valuable for identifying those having a high level of auditory 
perception and discrimination and those having a low level. 

The Wing and the Aliferis tests, on the other hand, using total scores, 
report quite satisfactory reliability coefficients. For the first of these, the 
indexes (split-half and test-retest) vary from .85 to .95; while for the 
second, the reported coefficient is .88. The reliabilities of the Drake tests, 
each part taken separately, are in the .80s and .gos. 

It is difficult to validate a test of musical aptitude against a criterion 
of achievement, because grades in music are fragmentary or nonexistent, 
grading is exceedingly subjective, training opportunities are markedly 
unequal, and interest and motivation are highly variable from person 
to person. Thus even after the years the Seashore tests have been in use 
many psychologists believe their predictive value has not been sufficiently 
ae curs i e himself, however, maintained that content and C 5s 

y are most appropriate and adequate for tests such as his. 


—-— 


APTITUDE IN THE GRAPHIC ARTS 461 


The Wing tests present stronger evidence of validity. With the first 
edition, in a study of above-average, average, and below-average students 
of musical instruments (ages 14-16; N = 333 boys), it was found that 
40 percent of the lowest group had discontinued their musical studies, 
às compared with 27 percent of the average and 2 percent of the highest. 
A finer classification would have defined the failures more precisely. 
Teachers’ ratings correlated .60 or above with scores on these tests. These 
validity data are very encouraging in this difficult area of testing. 

The Aliferis and the Drake tests also stand up reasonably well in re- 
gard to validity; the total scores of the former correlate between .50 and 
-60 with grades in music courses. The manual of the Drake test reports 
à median coefficient of .59 when scores were correlated with "talent" 
as rated for performance and learning music. 

On the whole, it appears that the sounder tests of musical aptitude 
are helpful in identifying individuals with poor prospects and those with 
high prospects for profiting from instruction in music, so far as these 
essential psychological functions are involved. In spite of differences in 
hypotheses and scoring, it appears from factorial analyses that the several 
testing devices are measuring the same or similar functions to a significant 
degree. It appears that there is a general factor operating in the several 
tests, accounting for 30 to 40 percent of the variance in scores. This 
general factor is identified as perception of a complex of the sensory 
aspects of musical aptitude. Since this appears to be the case, it would 
be desirable to measure these aspects by means of tests that demand 
complex perceptions, rather than by testing separate and isolated audi- 
tory perceptions. The former method of testing is indicated, especially 
because teachers of music and others in that field maintain that the 
Separate sensory functions do not operate in learning and performing 
às they do in controlled and isolated laboratory situations. The atomistic 
test, however, can serve a useful purpose in the case of an individual who 
is found to be deficient in one or more of the complex functions, and 
Whose deficiency is to be analyzed ín more detail in order to learn what 


is lacking. 


Aptitude in the Graphic Arts 


Testing aptitude in the graphic arts encounters much the same 


kinds of difficulties as testing in music, in regard to content and validity 
Criteria. Severa] of the current tests will be described and evaluated for 


illustrative purposes. 


Lit is unfortunate that among contemporary psychologists, research on psychological 
ems, as well as those in testing—gets little attention or 


Problems i i bl 
H in music—other prob ` 
T eae hetics is rather neglected by psychologists. 


1 3 " 
Mlerest. In fact, the entire field of est 
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The Graves Design Judgment Test is intended to measure the degree 
to which an individual “. . . perceives and responds to the basic prin- 
ciples of aesthetic order—unity, dominance, variety, balance, continuity, 
symmetry, proportion, and rhythm" (construct validity). This is done by 
presenting ninety pairs, or triads, of abstract designs in each of which 
one is organized in accordance with the eight principles listed. above, 
whereas the other design or designs violate one or more of the principles. 
An individual's selection of preferred designs ". .. would prove to be 
a criterion of his esthetic perception and judgment" (Fig. 19.1). Repre- 
sentational art was not included because it is believed judgments of such 


(©) 
O 


[o] 


Cc 


Fic. 19.1. From the Graves Design Judgment Test. The preferred 
design is to be selected. The Psychological Corporation, by permission. 
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pictures could be influenced by individual experiences, feelings, prej- 
udices, and conceptions of art. Relatively pure esthetic choice is there- 


fore emphasized. 
The Graves test is devised to evaluate a unified, complex perceptual 


function, rather than to make separate analyses of elements that enter 
into esthetic judgment. In this respect, it is consistent with the Wing 
test in music, and with the principle of a general factor in intelligence 
testing. 

The Meier Art Judgment Test is another device intended to measure 
esthetic judgment in a "global" manner. Unlike the Graves, however, 


ES 


CSE W 
X 


Fic. 19.2. One of these pictures represents an artistic 
work of established merit. The other is an adaptation 
of that work, and esthetically inferior. In this pair, 
the subject is required to select the original and es- 
thetically superior work on the basis of the shapes of 
the bowls. From the Meier Art Judgment Test. Bureau 
of Educational Research and Service, State University 


of Iowa, by permission. 


it consists of one hundred pairs of representational pictures in black and 
White. One member of each pair is a reproduction ofa recognized master 
piece, while the second member has been altered from the original in an 
important aspect so as to make it inferior to the original. Testees are 
informed regarding which aspect has been altered (for example, shape, 
angles), but they are not told which is the original. Each individual is 
required to indicate his preference 1n each pan (Fig. 19.2). Meier main- 
tains that esthetic judgment is the "key capacity, the most trustworthy 
and significant index to talent in art and to success in a career in art. 
The soundness of this view is seriously questioned, although few would 
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deny that esthetic judgment is one of the principal capacities required 
for a career in art. If it were true that esthetic perception and judgment 
constitute the "key capacity," rather than the capacities to execute and 
create, we should then expect art critics and other writers on art to be 
talented in actual production. Such, of course, is not the case. 
Two tests that require and evaluate samples of production in M 
graphic arts are the Knauber Art Ability Test and the Horn Art Aptituc j 
Inventory. 'The first of these, for use in grades 7 to 16, sets the following 
tasks: drawing a design from memory; drawing, from memory, figure 
within space limitations; drawing a stereotyped character, such as d 
Claus; arranging a specified composition within a given spaca; xc 
and completing designs from supplied elements; spotting errors in bm 
compositions, such as incorrect perspective, misplaced details, incer 
proportioned details, incongruous or inconsistent elements; peu 
of compositions intended to show creative imagination, ingenuity, abi E 
to represent a concept symbolically, or to plan and execute a universe 
idea. 4 
Measurements of these complex capacities represent a large pues 
for a single test. Performance on some parts, it appears, is depenc E. 
upon mastery of stereotypes and traditional problems and upon d 
presented in art instruction. This being the case, the Knauber test ate 
be useful primarily in evaluating school progress in art, quality of © f 
servation, and, to some extent, creative imagination. That is, the ki 
would be in large part a measure of learning rather than a ed 
whose primary usefulness is with individuals who have not had th 
benefits of some formal instruction. T9 
The Horn inventory was devised for use with applicants for adp 
sion to schools of art.? It is a test primarily of creativeness, using a pao 
flexible and simple prescription, or guide, for each drawing. First, ho 
ever, in Part I of the test, the examinee is required to sketch ten 
familiar objects (circle, house, book, fork) within short time limits EL 
on a quite small scale. He is required, also, to create simple t 
designs, starting with a given set of triangles, rectangles, etc. A m 
higher and more complex level of creativeness is required in Bara h 
“Imagery.” The examinee is given a set of twelve rectangular cards; eac 
one has several lines that must serve as the basis or beginnings d 
picture to be drawn (Fig. 19.4). Drawings are judged essentially on. t^ 
basis of creative imagination and technical quality (for example, OW 
clarity of thought, shading, quality of line). And in order to minimi? 


" 3 ; ther 
* There is no reason, however, why this test and others cannot be used with O 


individuals whose aptitude is to be assessed. 
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PROBLEM 3 
This is a test of accuracy and observation. 


aE 
(CS (KS C Een 


Score 10 
These drawings are good in proportion and quality of line. Details are 
accurately observed. 


Score 6 
The proportions in these drawings are not es good, and the lines are weaker. 


"y, MaDi oN” t 


Score 3 

These drawings are poor in proportion and in execution. They indicate a lack 

of careful observation. 

Fic. 19.3. From the Knauber Art Ability Test. The subject is shown 

a design, in the examination booklet, which he is to copy. This item 

is a test of accuracy and quality of reproduction. By permission. 
Subjectivity of judgment in scoring, examples are provided of drawings 
rated as excellent, average, or poor. 

Evaluation. Once again, even in this highly subjective area 

9f psychological testing, we find reliability coefficients varying from some- 
What low to satisfactory and high, as shown in Table 19.1. In most of 
these reliability studies, the groups were relatively homogeneous. For 
example, in the Graves studies, for third-year students of “illustration” 
the coefficient was .81; for first-year engineering students it was .93. 


3 Other tests are listed in the references at the end of this chapter. 
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Fic. 19.4. From the Horn Art Aptitude Inventory. The person benk 
tested is given card A. With the given lines as a beginning, he must ma 
a completed drawing. B and C are completed samples. By permission. 


S TABLE 19.1 


RELIABILITIES OF Four ART APTITUDE TESTS 


———Á— ee eee 


Test Correlations Method 
Graves .81-.93 Split-half 
Horn .83-.86 


Ratings of the same papers 
by two teachers of art 

-76 Alternate forms rated by 
two teachers of art 

Split-half 

Split-half 


Knauber 95 
Meier .70-.84 
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To evaluate validity, the authors of these devices followed familiar 
practices. Scores were expected to differentiate significantly among the 
several age and grade groups; art students and those who had had art 
education were expected to get significantly higher scores than others 
(for example, art faculty members vs nonart faculty; art students vs non- 
art students). The usual correlational validity studies were made, using 
small groups, and the results in Table 19.2 are representative. 


TABLE 19.2 


VALIDITY COEFFICIENTS OF Two ART APTITUDE TESTS 
en Sa OE EE Eee 


Test Coefficient Criterion 
Horn 53 Mean rating in a 3-year art 
course 
66 Grades in a high school 
senior art course 
Meier 40-.69 Grades in art courses and 


ratings on creativeness 


ee M — — M Hà 


Authors of tests in this area (for example, Graves and Knauber) used 
ity in developing their instruments. Graves, 
e criteria in deciding whether or not to 
chers of art as to the preferred 


the principle of content valid 
for example, cites these thre 
retain an item: (1) agreement among tea ) 
design in a pair or triad; (2) greater preference of a design by art stu- 
dents than by nonart students; (3) greater preference for a design by 
high scorers on the whole test than by low scorers. (Percentage of agree- 
ment among teachers and degrees of differences in the latter two criteria 
are not stated in the manual.) 
It is evident that the authors of tests of aptitudes in the graphic arts 
are not in complete agreement regarding the capacities to be measured, 
although there is some common ground. N. C. Meier, after years of study 
and experiment, identified six interrelated traits that characterize those 
having aptitude in the graphic arts, though not all are found in everyone 
with such aptitude. Meier's conclusions are based upon studies of high 

school and college students (39). 
1. Manual skill. 


2. Energy and perseveration. (This trait is much the same as the one that 
Francis Galton called “zeal,” in characterizing the work of gifted men and 


women.) 
9. Esthetic intelligenc 


telligence to the field of art. a . i 
4. Perceptual facility. “The relative ease and effectiveness with which the 


e. This refers to the application of one’s general in- 
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individual responds to and assimilates experience which has potential sig- 
nificance for present and future development in a work of art." , 

5. Creative imagination. “The ability to visualize vivid sense impressions 
effectively in the creation (organization) of a work having some degree of 
aesthetic character." 

6. Esthetic judgment. "Ability to recognize aesthetic quality in any rcla- 
tionship of elements within an organization." 


The tests are of two general types: those that evaluate esthetic judg- 
ment and those that require the production of drawings. Since aptitude 
in esthetic judgment is not equivalent to aptitude in the production of art, 
a combination of both types should have greater predictive and selective 
value than either one alone for the purposes of education and guidance. 

It is not surprising that tests in this category differ in content, for 
experts in the highly subjective field of art often disagree regarding the 
basic criteria or principles to be employed in the selection of superior 
designs or pictures; that is, they disagree on esthetic judgment. It is more 

‘than ordinarily difficult, therefore, to establish universal criteria for the 
selection and scoring of test items and for the validation of test scores, 
once the items have been selected. 

Closely related to this problem is the inevitable result that different 
teachers and critics of art, employing varying criteria in their evaluations, 
will rate art productions differently. Consequently, there are relatively 
few data on the predictive efficiency (validity) of tests in the fine arts 
with regard to level and quality of performance in educational and 
vocational activities. Available tests are useful, however, when the criteria 
are specified, in identifying individuals of unusual capacity and in eval- 
uating esthetic judgments in general. 

As already stated, these tests distinguish quite consistently between art 
students and nonart students, as groups; the differences between their 
mean scores are significant, and the respective standard deviations of 
their scores indicate a significant difference in the location on the scale, 
or the limits, of their score distributions. For nonart students, the tests 
can be useful in evaluating capacities that are essential in esthetic appre- 
ciation, as a basis for nonvocational education in the fine arts. 

Finally, the available tests of esthetic judgment assume, apparently, 
that this capacity is generalized and transferable from the evaluations of 
Pictures and of relatively pure designs, to architecture, furniture, clothing; 
sculpture, and industrial products. This assumption may be warranted 
and actual demonstration of its validity would be of significance in 
school and college courses that offer instruction designed to cultivate 
esthetic judgment. Definitive research has yet to be done on this psycho- 
logical and educational problem.+ 


4 o ; d ; 
The graphic arts, like music, have been neglected by contemporary psychologists: 
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Aptitude in Medicine 


Background. The first of the aptitude tests for medicine ap- 
peared in numerous revised editions over a period of about twenty 
years, under the direction of F. A. Moss, sponsored by the Committee on 
Aptitude Tests for Medical Students, Association of American Medical 
Colleges. It is interesting to note the functions tested by these, and later 
to compare them with the content of the instrument now current. 

The earlier devices included the following subtests: visual memory, 
memory for subject-matter content, scientific vocabulary, understanding 
of printed material, scientific definitions, and logical reasoning. These 
subtests were based upon an analysis of aptitudes believed, by experts, to 
be necessary for the study of medicine; namely, content validity. The sub- 
tests were justified on the following basis. First, it is essential to have 
sufficient mental alertness to learn quickly and to organize the material 
learned, so that it can be retained and utilized in later work. A sampling 
of medical materials was used in the test to examine this capacity. Second, 
past scholastic performance may be expected to indicate future learning. 
Inasmuch as all premedical students have had elementary courses in 
chemistry, physics, biology, and English, sections of the test are devoted 


to questions in these subjects to determine the extent of the candidates' 


learning in them. Third, the capacity to make correct interpretations 


and deductions from given data was considered essential. Hence, the 
test included a passage of difficult reading, using medical subject matter. 
The testee is required to make certain interpretations and A e 
based upon the passage, to which he may refer at any time. ok , since 
medical students and practicing physicians are expected to e vu 
Clusions and make diagnoses from given facts, a subtest was con to 
evaluate “logical reasoning.” This consists of a set of pones e ue 
clusions drawn from them, the student's task being to determine whether 
Or not the conclusions are warranted. — i s 

In 1946, the committee, which Moss directed, was TR 
quently and for a number of years, the development à d ls 
"ptitude was taken over by the Educational T ETT Ti a ae 
these tests are being administered and studied by sta - i ROAR i 
ip chological Corporation. Although n b. Py inis Paste 

We materials differ i ific content fr À 
SD Fur enisi d s will show that the psychological and 


ioe pe e or Dedi hat the mental capacities measured 
I a rved tha : 
GER VIEW materials are not peculiar to the 


Y means of th regoing types of test : 5 
e foregoing ty} UT 1 
Study of vice "hey Lr indeed, capacities required 1n all fields of 
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higher education and in all professions. These devices, therefore, wo 
called aptitude tests, are actually tests of general mental p e 
izing, in part, selected materials included in or closely associate i : 
the materials studied in medical schools. In other words, the forms M 
mental activity being tested are the same as in any other map 
field, but the content is in part specialized. It is in this sense that these 
are "aptitude tests," as the term has been defined. ' a: 
Medical College Admission Test. The purpose of this apinga 
test was stated to be the provision of highly dependable measures of ES 
advanced student's general ability and of his achievement in ur pe 
field of study. The tests are predicated upon the principle that a signi jen 
aspect of potentiality for a specialized field of study at the graduate eee 
preprofessional level may be measured by testing the candidate s serp 
scholastic ability and his achievement in a special field that is prerequis! 
to advanced study in the same or a closely related field. d 
In the addition, a test of "understanding modern society" (now calle 
"general information") has been included in the battery. This part E 
cludes current history, economics, political science, and sociology, RS 
purpose being to evaluate alertness to and interest in social issues, p 
than materials retained from college courses in these fields of zr Mi 
The total test consists of four parts, described in the following terms: 


Verbal ability: a test of vocabulary strength and ability to perceive verbal 
relationships; it is indicative of ability to handle postgraduate study. 

Quantitative ability: a measure of ability to reason through and under- 
stand quantitative concepts and relationships; with verbal ability, it is in- 
dicative of general academic aptitude. 

Understanding modern society (general information): a test broadly cover- 
ing the subjects of history, economics, government, and sociology as p 
pertain to the contemporary scene; it measures social awareness; it does no 
presuppose specific college courses in these subjects. 

Science: a test covering a wide sampling of concepts and problems taken 
from college courses in biology, chemistry, and physics. 


The following are sample items.5 


Verbal ability 

Choose the word which is most nearl 
in capital letters: 

SPORADIC: (4) immediate (B) regular (C) affiliated 

(D) conflict (E) replete 

Each sentence below has one or more blank spaces indicating a word has 
€n omitted. Choose the one word or set of words which, when inserted 
the sentence, best fits with the meaning of the sentence as a whole. 


y opposite in meaning to the word 


be 
in 


*From the 1961 announcement of the test, issued by The Psychological Corporation: 
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The manufacturing of small machinery is profitable for the nation 


lacking mineral resources, inasmuch as can be exploited 
in place of 

(A) quantity . . . quality (B) power . . . efficiency (C) skills... 
quality (D) alloys . . . ores (E) skills . . . materials 


Each question below consists of two words which have a certain relation- 
ship to each other followed by five lettered pairs of related words. Select 
the pair of words which are related to each other in the same way as the 
pair of words in capital letters. 

ASHES : FIRE :: (A) extinction : life (B) repentance : sin 
(C) disaster: jealousy (D) depression : prosperity 
(E) relics : civilization 

Quantitative ability 

Solve each problem and then indicate the one correct answer. 

If a motor launch goes 3r miles on 2t gallons of fuel, how many miles will 

it go on 6 rt gallons? 

In questions 14 and 15 it is not necessary to solve the problems, but simply 
to show what information is needed to solve each of them. Read carefully 
each question and the two facts which follow it, then choose 

A if fact (1) alone is sufficient but fact (2) alone is not sufficient to 
answer the question, 1 

B if fact (2) alone is sufficient but fact (1) alone is not sufficient to 
answer the question, 

C if facts (1) and (2) together are su 


but neither fact alone is sufficient, à 
D if either fact (1) alone or fact (2) alone is sufficient to answer the 


ficient to answer the question 


question. 
E if more inform 
Be sure to decide on the m 


ation is needed to answer the question. 
inimum facts which are necessary to answer each 


question. 
14. Can line segments XZ, XY, and YZ form the sides of a triangle 
XYZ? 
(1) XZ=YZ 


(2) XZ = 34, XY 
15. Is X greater than S D- 
(1) 3X = 2k, 4Y = gk, k is positive 


(2) X+Y¥=5 


Understanding modern society 
mate ae Blbne questions or incomplete statements is followed by 
five choices. Select the choice that best answers the question or completes 
th y Ay 
à MOE ye following, which is the best measure of public interest 
in a particular election? 
(4) The number of offices to be filled 
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(B) The size of the popular vote 

(C) The amount of campaigning preceding the election 

(D) The importance of the issues at stake , 

(E) The amount of money spent by the opposing parties for 
campaign purposes - 

17. In the majority of cases, the residential distribution of families 

in a city is governed chiefly by the 

(A) length of residence in the city 

(B) ability to pay for housing and transportation 

(C) scale of values concerning the most desirable quarter of the 
city 

(D) topographic features of the city 

(E) location of hospitals and schools 


Science 
Each of the following incomplete statements is followed by five suggested 
completions. Select the one completion which is best in each case. 
The nucleus of an atom consists of 

(4) electrons and protons 
(B) protons and neutrons 
(C) electrons, protons, and neutrons 
(D) electrons and neutrons 
(E) protons only 


Each of the following questions consists of an incomplete analogy with 
five suggested completions. Select the one word or phrase which best comr 
pletes the analogy and indicate your selection in the appropriate space on 
the answer sheet. 

26. ohm : resistance : : watt : 
(A) electricity (B) work (C) power (D) current 
(E) potential 

27. atom : molecule : : element : 
(4) electron (B) mixture (C) isomer (D) isotope 
(E) compound 

28. yolk: egg: : : bean seed 
(4) hypocotyl (B) epicotyl (C) cotyledon (D) testa 
(E) endosperm 


The following paired statements describe two entities which are to be 
compared in a quantitative sense. Choose 
A if (A) is greater than (B) 
B if (B) is greater than (A) 
C if the two are equal or very nearly equal 
29. (4) The total resistance of two given resistances in series 
(B) The total resistance of the same two resistances in parallel 
30. (4) The volume occupied by one gram-molecular weight of he- 
lium at standard conditions 
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(B) The volume occupied by one gram-molecular weight of 
oxygen at standard conditions 


The passage below is followed by a series of statements. Read the passage 
and then classify each of the statements under one of the following cate- 
gories: 

(A) The statement is warranted by information given in the pass- 
age 

(B) The statement is true but not warranted by the passage 

(C) The statement is contradicted by the passage 

(D) The statement is contradicted by established evidence but not 
by the passage. 

From a nutritional standpoint, proteins are classified as com- 
plete and incomplete. A complete protein is one which con- 
tains all the eight amino acids essential for growth and main- 
tenance; an incomplete protein is deficient in one or more of 
the essential amino acids. A person on a diet consisting mainly 
of incomplete proteins soon develops a negative nitrogen bal- 
ance, loses weight, and experiences a lowered resistance, with 
impaired physiological processes. A vegetarian diet must be 
supplemented with milk, dairy products, or eggs to maintain a 
positive nitrogen balance in the human body. Most animal 
proteins, except gelatin, contain all the essential amino acids 
for normal growth and maintenance, but certain plant pro- 


teins are deficient in one or more of the essential amino acids, 


32. A gram of protein when oxidized in the body tissues yields 


about 4 calories of heat energy. 
33. Gelatin is deficient in one or more of the essential amino acids. 


34. Incomplete amino acids contain no nitrogen. : 
35. A diet made up entirely of fruits and vegetables will supply all 


the amino acids necessary for growth. 

The purpose of these tests is to provide an im- 
formance in medical studies, 
the test scores have been 


Evaluation. c 
Proved basis for predicting quality of per 


Not in medical practice. In some instances, : i 
urse grades; in other instances 


better predictors than have premedical co Ead 1 
the reverse has been true. But in all cases, the best criterion is a combina- 
tion of the two. For example, in one validating study, the following 
Coefficients were found for a group of students who were not selected on 
the basis of test results as one criterion: premedical course grades with 
medical school averages, -67; test scores with medical school averages, .64; 


medical school averages with test scores and premedical grades Hou 
(multiple correlation), .81. These are very satisfactory resu ts. Other 
Correlational studies yielded higher simple coeflicients in some instances, 


but lower in others. 
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When medical students are selected on the basis of test scores as one 
criterion, however, it is to be expected that for them the eie 
with grades (predictive validity) will be lower than those reported a ig 
since the ranges of test scores and course grades have been narrow > 
(See discussions of correlation in Chapters 2 and 5.) In such studies, t : 
correlations have been .40 and below (6, 22). The low coefficients found 
for restricted groups do not necessarily invalidate the tests. -€ 

Variations in correlation coefficients between test scores and medica 
school grades, found in different studies, are not attributable solely a 
inadequacies of the medical aptitude tests. Differences among coefficien s 
also reflect differences in medical school grading standards, me 
of undergraduate preparation (which to some extent can be compensated 
for in medical school courses), and personality traits, which tend to ua 
duce inconsistencies between promise and performance. It would be 
highly desirable, also, to study the validity of these aptitude tests when 
the scores on General Information are omitted and those of the other 
three sections are combined. The reason is this: it appears that medical 
students and practicing physicians have little interest in general prob- 
lems of human and social welfare, strange as that may seem (22). , 

Expectancy statistics have also provided useful findings regarding T 
scores. For example, one study reports that only 1 percent of the st 3 
dents in the highest decile group failed in medical school, whereas Hs 
percent of those in the lowest decile group failed. These findings provi 
an argument for admitting all top-decile students, but not for refusing 
admission to those in the lowest. Another study reported that the lowest 
decile group contributed 25 percent of the failures—that is, two and one 
half times its quota, in proportion to the total group of students. 

Reliabilities of the verbal and science sections of the tests are approxi 
imately .go; for the quantitative section the reported coefficient is .82, 
and for that on modern society it is .84. 1 

Although these aptitude tests are not intended to predict effectiveness 
in medical practice, which, like other professions, is dependent upon 2 
complex of factors, the tests appear also to have some value in forecasting 
medical students' levels of success in internships. When a group of interns 
were rated on a five-point scale by their hospital staffs, the results showed 
that the tests have some selective value in identifying students who prove 
to be the most satisfactory interns. hs 

With study in professional schools, including medicine, now within 
the prospects of large numbers of students, continued research leading 
to the development of increasingly effective testing instruments is essen- 
tial. It would be desirable to have thorough studies made of such factors 
as effects of coaching and cramming upon the medical-aptitude test 
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scores; relationships between test scores and ratings on interest inven- 
tories; relationship of test scores to drop-outs, in the form of expectancy 
tables (that is, value of the tests in predicting survival in the profes- 
sional school; not in predicting grades alone); correlations between test 
scores and interview ratings; role of personality traits in medical school 
performance and survival (that is, degree of emotional stability, degree 
and type of motivation, introversion-extroversion, dominance-submis- 
sion, kinds of strengths of values, and so on). Research on these and other 
personality traits would be difficult and time-consuming, but they could 


prove significant. 


Aptitude in Law 


Tests in this field are aptitude tests in the same sense as are 
those in the field of medicine. Psychologists and others concerned with 
the problem of testing aptitude for the study of law agree that the follow- 
ing abilities are most important: reading rapidly and comprehending 
relatively difficult material, memorizing and accurate recall, reasoning by 
analogy, discriminating between the relevant and the irrelevant in a mass 
of facts, reasoning inductively and deductively, and facility in acquiring 
and using a vocabulary. Legal aptitude tests thus far constructed attempt 
to measure most or all of these abilities in some degree. The device de- 
scribed in the following pages is the one currently required of all appli- 
cants by a large number of law schools in the United States. 

Law School Admission Test. This instrument was introduced 
in 1948. The authors state the tests are designed to measure capacity to 
read, to understand, and to reason logically with a variety of verbal, 
quantitative, and symbolic materials; to measure knowledge of the 
mechanics of writing, ability to organize a piece of prose, ability to im- 
Prove a badly written passage, and knowledge of basic information in 
the fields of humanities, sciences, and social studies. To measure these 


abilities, the following subtests are included. 


Principles and cases: the examinee judges the relevance of stated prin- 


ciples to given cases. t y 1 j 
Data interpretation: a measure of quantitative reasoning, including the 
use of tables, charts, and graphs. : 
Reading comprehension: passages with general content, followed by ques- 
tions based on the content (stated or implied). 3 
Readin. comprehension and recall: this differs from the preceding sub- 
test in tat the questions are to be answered from recall of the content of 


the passage. M 4 F LM : 
Writing ability: ability to recognize errors in written English, as in 
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diction, verbosity, grammar; and ability to correct a poorly written brief 
passage ("interlincar exercise"). 
Organizalion of ideas: sets of statements, each of which is to be classified 
as a central, main supporting, or irrelevant idea, or as an illustrative fact. 
Figure classification: a nonverbal perception and reasoning test, similar 
in principle to those used in nonverbal group tests of intelligence (see Chap- 
ter 15 and Fig. 19.5). 


Directions. Each of these problems consists of two groups of figures, labeled 1 
and 2. These are followed by five lettered answer figures. For each problem you 
are to decide what characteristic each of the figures in group 1 has that none of 
the figures in group 2 has. Then select the lettered answer figure that has this 
characteristic. 


slbi Esc LA Om ol M o TY Nee 


bonk 


Mo cj m n d 
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(In sample problem 8 you will note that all the figures in group 1 are rectangles 
but none of the figures in group 2 is a rectangle. In sample problem 9, all the 
figures in group z include a dot but none of the figures in group 2 includes a dot- 
The figures in group 1 sample problem ro are all white figures, but none of i 
group 2 figures is white.) 


Fic. 19.5. Sample of Figure Classification, from the Law School Admission Test. 
Educational Testing Service, by permission. 


General background: multiple-choice questions in the fields of the hu- 
manities, physical and biological sciences, and social studies. 


: Evaluation. It is essential to note, first, that these tests are 
intended to predict quality of scholastic achievement in law schools: 
they do not claim to be useful in predicting performance or degree of 
Success in the professional practice of law. 

Reliability coefficients of the LSAT were high when the Kuder- 
Richardson formula was used; they were in the low .9os. The tests have 
been rather widely studied in regard to prediction of scholastic achieve 
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ment. The two most valuable criteria for prediction are grades in pre- 
law courses and test scores. Some investigators have found the former 
yield higher correlations, and a few report that the latter do so. For 
either of these two criteria, in studies of earlier versions of the Law School 
Admission Test, the correlation coefficients with first-year marks in law 
schools were mainly between .40 and .60. However, when prelaw college 
grades and law-aptitude test scores were combined, coefficients of multi- 
ple correlation with law school grades were raised to the high .60s, and 
in some instances well into the .70s. 

Subsequent studies, with later versions, have yielded much the same 
results (29, 33, 45, 59). The correlations obtained with first-year grades, 
over a period of ten years, were as follows: 


Prelaw record: median coefficients for each of the four years ranged from 


the middle .30s to low .40s. 
Test scores: median coefficients ranged from the high .gos to high .50s. 


Median multiple correlation coefficients for the two criteria ranged from 


the high .40s to middle .60s. 
Modal interval for prelaw record is the .gos. 
Modal interval for test scores is the .40s. 
Modal multiple-correlation interval is the .505. 
ables provides additional evidence of the value 


The use of expectancy t : 
aw schools were combined, they 


of the LSAT. When the data for four 1 
showed that: 
of those in highest 
or better than average first-ycar grades; 
of those in the next 12 percent, 86 in 


average grades; 4 
of those in the /owest 4 percent on test scores, only 4 in 100 earned 


an average first-year grades; 
t 12 percent, only 14 in 100 ac 
J 


4 percent on test scores, 96 in 100 carned average 


100 earned average or better than 


average or better th 
of those in the nex 


better than average grades." i i 
In spite of the fact that third-year students in law schools are a some- 


what more selected group than first-year students, the test scores correlate 
i H 
just about as well with averages for the three years as they do with only 


first-year averages. 
There are several 
not higher than those 


law schools are a group se $ 
admission-test scores. The result is r 
mance of the students ranking approximately 
and those approximately onc SD or more 


hieved average or 


coefficients of predictive validity are 
reported above. First, students admitted to most 


lected on the basis of prelaw college grades and 
estriction in range of scores on the 


reasons why the 


* Obviously, thesc data represent P MA 
one SD or more above the mean test si 


below the mean. 
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aptitude tests and of grades in law school courses. This situation, as 
explained in Chapters 4 and 5, lowers the coefficient. 

A second factor is the disparity among academic and grading standards 
that prevails among colleges that provide prelaw education. Consequently, 
college grades do not offer a consistent basis for correlation with law 
school grades. A third factor contributing to lowered coefficients is varia- 
tion in standards and in grading among the law schools themselves. 
These three considerations are adequate reasons for establishing specific 
rather than general validity. That is, validity should be determined for 
each law school individually with regard to prelaw grades and test scores. 

Tests for the selection of students of law are sufficiently developed to 
warrant their use in conjunction with prelaw course grades. This CODE 
clusion is indicated by the validity coefficients found in some universities, 
by the quite significant multiple correlations with prelaw grades and test 
scores combined, and by the ability of the tests to identify a large per- 
centage of candidates who have high promise and of those whose promise 
is low.* 


Aptitude for Teaching 


It is extremely difficult to devise tests for the selection of teach- 
ers because their preparation is not so clearly defined as in the case of 
law and medicine, since the contents of professional courses of the same 
name will differ considerably, and since teaching encompasses such a wide 
range of subject matter and educational levels. Of equal importance with 
general mental ability and competence in subject matter to be taught 
are the personality traits essential in successful teaching. As yet, no de- 
finitive psychological studies of those traits are available. From the de- 
scriptions that follow, it will appear that tests of aptitude for teaching 
should, strictly, be included in the chapter on tests of educational achieve- 
ment and of proficiency. They are included here, however, because they 
deal with the selection of personnel for one of the professions. 

f The National Teachers Examinations are essentially subject-matter tests 
(including problems and applications) to evaluate the adequacy of prep- 
aration of candidates for teaching positions in elementary and secondary 
schools, The two major divisions are (1) common examinations and (2) 
optional examinations. The first includes sections on professional in- 
formation, general information (social studies, literature, fine arts, science 
and mathematics), English expression, and nonverbal reasoning (similar 


7 

eee was preceded by several other law aptitude tests, for example, the Ferson- 

XR ard Law Aptitude Examination (1927), the Iowa Legal Aptitude Test, by M. Adams 
* (1943-1948). See References at the end of this chapter. 
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to the nonverbal subtests in measures of general intelligence). The op- 
tional examinations include eleven tests in specialized fields of teaching, 
from among which the candidate elects one or more. These eleven fields 
include subject-matter specialties (for example, biology, music, English, 
mathematics) and general educational levels (for example, elementary 
school, early childhood education). 

Another set of tests, Teacher Education Examination Program, is much 
the same as the examinations in the NTE. Another device, the Graduate 
Record Examinations Advanced Tests: Education, however, serves a dif- 
ferent purpose. It is intended to evaluate college graduates’ qualifications 
for graduate work in education. For this purpose, the examination con- 
sists of 200 items that sample the several fields of professional study in 
the preparation of undergraduates for teaching (for example, educational 
philosophy and psychology, curriculum and methods). 

Efforts are being made to devise tests that will provide information 
regarding the candidate's knowledge of and insights into nonintellectual 
aspects of the behavior of children and adolescents, the belief being that 
such knowledge and insights increase the probabilities of one's success in 
teaching. Noteworthy here is The Case of Mickey Murphy: A Case Study 
Instrument in Evaluation by W. R. Baller (5). This is an actual case 
record, in great detail, of a 14-year-old boy. Interspersed are 150 ques- 
tions on interpretation of data, conclusions to be drawn, and formulation 
of plans. A 

Evaluation. In one very important respect, these tests differ 
from those in law and medicine. Whereas the last two are administered 
before professional education is begun, teaching aptitude tests are ad- 
ministered during or after professional education, to evaluate the profes- 
sional knowledge and understanding the candidate has already acquired. 
Although the general and professional information and judgments being 
evaluated are essential to teaching success, authors of these tests recognize 
that knowledge of subject matter to be taught, understanding of teaching 
methods, a sound educational philosophy, and psychological information 
about human behavior do not necessarily indicate one s ability to apply 
these in actual teaching. Other aspects of the candidate's personality need 
evaluating if future classroom effectiveness is to be estimated. These in- 
clude motives for teaching, emotional stability, social values, ability to 
communicate, ability to establish rapport, attitude toward and concept 
of one's self. These traits. admittedly, are difficult to assess and would 
require, at best, a procedure tantamount to clinical assessment or, at 
least, the use of a series of self-rating personality inventories. Probably 
the most feasible procedure is to use the tests described above to supple- 


8 An earlier device was the Coxe-Orleans Prognosis Test of Teaching Ability (10). 
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ment the “work sample" method; that is, practice teaching by the can- 
didate, rated by an experienced critic on a soundly devised rating scale 
(see Chapter 21). i Aok 

From the technical point of view of test construction, these exams 
tions stand up well in regard to reliability (where reported) , the correla- 
tions being .85 and higher. In the absence of predictive validity ipee 
formance in actual teaching), the criterion has been content validity; for 
the tests are based upon the judgments of professionally qualified in- 
dividuals. 

Although these examinations are not employed in a large percentage 
of American school systems in selecting teachers, their use is steadily in- 
creasing as the nature of their contribution, and of their limitations, 1$ 
recognized. 


Aptitudes in Science and Engineering 


Aptitude in science is not a special talent in the same sense p 
musical aptitude, for example, is said to be. Scientific aptitude is ss 
application of general intellectual capacity to scientific materials ES; 
problems. A test of scientific aptitude, therefore, should be regarded S 
a device intended to estimate probability of success in scientific Pen 
engineering occupations, without implying that it measures psychologica 
functions that are essentially different in form from those required in 
other types of mental activity. 

An early illustration of this type is the Stanford Scientific Aptitude Test 
(1930) which was intended for high school seniors and college mds 
Its author stated that the subtests evaluate experimental bent; clarity E 
definition; suspended versus snap judgment; reasoning; inconsistencies; 
fallacies; induction, deduction, and generalization; caution and thor- 
oughness; discrimination of values in selecting and arranging expert 
mental data; accuracy of interpretation; and accuracy of observation. At 
present, this test is only of historical interest, having been displaced by 
the Scholastic Aptitude Test and others of the same type (see Chapter 16). 

The Engineering and Physical Science Aptitude Test (B. V. Moore 
et al.) consists of a group of previously developed and standardized tests- 
The six subtests are: mathematics (algebra), formulation of scientific rela- 
tionships in algebraic terms, physical science information, arithmetical 
reasoning, scientific vocabulary, and comprehension of mechanical rela- 
tionships and problems (presented in pictorial form). The authors present 
statistical details which show that their test has a reasonably high degree 
of validity when college grades in introductory engineering subjects are 
taken as criteria, The highest validity coefficients were found with grades 
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in physics and chemistry, whereas the lowest were found with those in 
"manufacturing processes" and drafting. This fact suggests that the in- 
strument tests aptitude in the scientific subjects more than in the applied 
engineering subjects of study. 

Correlations with grade averages for each of the eight semesters ranged 
from .58 (first semester) to .26 (eighth semester), the decline in coefficients 
being steady from semester to semester, with one minor exception. To 
account for the decline from a coefficient of moderate magnitude to one 
that is very low, the usual influencing factors may be operative. These are 
increasingly homogeneous groups of students, reduction of individual 
differences resulting from earlier handicaps or advantages, increasing pro- 
fessional interest and motivation in students, unreliability of grading, 
etc. But here again is an instance in which expectancy tables would pro- 
vide more information than correlation coefficients and probably would 
indicate that this aptitude test has greater predictive value than the cor- 
relations suggest (see Fig. 19.6). 

The Pre-Engineering Ability Test (Educational Testing Service) has 
only two parts: comprehension of scientific materials and general mathe- 
matical ability. The first part requires reading and answering questions 
on scientific selections, tables, and graphs. The items in mathematics in- 
clude arithmetic, algebra, and geometry (including analytic geometry). 
Although it is maintained by the authors of this test that specific factual 
knowledge is not needed for the comprehension items, there is no doubt 
that familiarity with some of the materials, vocabulary, problems, and 
concepts in the physical sciences is an advantage. 

The Minnesota Engineering Analogies Test was constructed for use 
with candidates for jobs and for admission to graduate schools. As its 
name indicates, in form it is like the Miller Analogies Test. Emphasis, of 
course, is upon mathematical and scientific materials, the content having 
been selected largely from courses taken by all engineering students in 
their first two years of undergraduate study. Correlations with grades and 
faculty ratings in undergraduate and graduate courses varied from 4o to 
60, while correlations with salaries and ratings on the job NT in the 
neighborhood of .go. The latter coefficient is probably to be explained by 
the fact that this analogies test purports to measure only a limited though 
important aspect of the traits required in the successful practice of en- 
gineering, that is, logical reasoning with scientific and mathematical ma- 
terials. 


Another example of the measures in this field is the Advanced Test in 


Engineering, a Graduate Record Examination. This, also, is for use with 
ao oS "ni H H 

candidates for admission to graduate study. Each GRE in a special field, 
such as engineering, is intended to measure what the candidate has 
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learned in his field (information, principles, problem solving). It is, thus, 
an educational-achievement test that is regarded as having value in 


estimating aptitude for graduate study. 

Evaluation. The purposes of instruments in this field of test- 
ing are to measure the most important elements of aptitude significant in 
the guidance and selection of pre-engineering students and graduate stu- 
dents. Some of the current tests measure a fairly large variety of elements 
believed to be associated with success in the study of science and engineer- 
ing. Others measure only a few. The tendency now is to reduce the num- 
ber of subtests and to improve these, so that the smaller number have 
as much predictive value as the larger. 

These aptitude tests correlate less well with term grades in engineering 
schools (usually .55 or lower) than they do with tests of general ability 
(70 or higher). And the latter themselves have as high a predictive value 
in engineering as the former. The two types thus have much in common. 
Yet they are not perfectly correlated; nor will each type necessarily be as 
effective as the other in the case of any specific student. Better results, 
therefore, will be obtained if both types are administered, one to supple- 
ment the other, and if, in addition, the reasons are sought for significant 
discrepancies that might be found between the two scores of any can- 
didate.® 

The desirability of investigating specific institutional validity is in- 
dicated for these tests, just as in the case of law and medical school tests; 
for the complex problems of intellective and nonintellective behavior are 
encountered in all areas when psychologists deal with learning, school- 
ing, motivation, standards, and so forth. 

Results obtained with current tests of engineering and scientific apti- 
tude can be most helpful when analyzed and interpreted by a qualified 
professional counselor or psychologist. It is not always the total score 
alone that is useful. A profile representing scores on subtests can be even 
more valuable; but it is necessary to gO still further in the analysis of an 
individual's performance. That is, a detailed analysis of [gue on 
each type of subtest and item may reveal an individual s strengths and 
weaknesses that can then be evaluated in an interview with the subject, in 
the light of other available educational and psychological information 
regarding him. Although numerical ratings obtained on some engineering 
and scientific aptitude tests will be of significance, greater value will be 
derived from them in counseling by a qualified person who interprets the 


? Students should consult the discussions of multifactor scales (Chapter 17), especially 
with reference to their tests of space relations, mechanical reasoning, and quantitative 
reasoning ability. Also, consult Chapter 16 with reference to tests of general ability for 


high school seniors. 
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results in the light of his educational, psychological, and sin remi e 
sights. This is a subjective approach in part; but sole i ip : T" 
not be placed upon aptitude-test scores. Furthermore, every type o pas 
seling and guidance involves a subjective element, insofar as the counselor 
interprets, collates, and draws conclusions—as he must. 


Evaluation of Aptitude Tests 


Differences among Tests. Various testing devices have been 
described, beginning with those that measure limited sensory perception 
and motor response, continuing to those that measure 
more varied functions as in mechanical and cleric 
culminating with those involving even more compl 
to predict capacity for learning the types of materi 
and in professional schools of medicine, law, engin 

It should be clear from the descri 
struments designated 


in varying degrees. On the whole, those me 


more complex and 
al performance, and 
ex processes intended 
als taught in the arts 
eering, and education. 


asuring the more limited and 
and motor—come closer to satis- 
igned to measure the complex men- 
content of professional courses of 
'€ aptitude tests largely in the sense that they 
employ a selected and specialized content, performance on which is in- 
tended to forecast scholastic achievement within a limited and specialized 
range of studies. 


Reliability and Validity. 
shown satisfactory reliability, Their v. 
job is concerned, has not always bee 
there is appreciable variation 


Most current tests of aptitude have 
alidity, so far as performance on the 
n adequately demonstrated, though 


among tests, Nevertheless, in each field dis- 
tests is hig 
to learn to warrant their use i 


© tests will depend upon the purpose for 
which they are being used. If they are to be used for the guidance of an 
du oncern of the counselor, then their degree 
of validity is a matter ot primary importance; and it is essential that their 
results be supplemented with sc 


i hool records, intelligence-test performance, 
evidence of interests, and interviews, 


If, on the other hand, an aptitude test is being used by a personnel 
manager whose prim i 


incidentally concern 
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terested in validity, he will evaluate a test in terms of the extent to which 
it enables him to select employees who have the best chances of succeeding 
on the job. That is to say, the personnel manager may want to identify, 
in a large group of persons, a smaller group who, on the average, will ex- 
cel the larger group in the trait being tested. This procedure means that 
among those persons selected for employment some will fail, while among 
those rejected some would have succeeded. The purpose of such a pro- 
cedure is to raise the percentage of successful selections. Admissions 
officers of colleges and professional schools can also employ the same 
principle if they are satisfied with a mechanical and impersonal pro- 
cedure, For this purpose, tests of even relatively low validity (coefficients 
of .30 or .40) have some value when supplemented by expectancy tables. 

In this connection, the "selection ratio" (SR) is often used (53). The 
term is defined as the ratio of the number of persons selected to the 
number tested. Thus, if 200 individuals were examined for a specified type 
of employment or for admission to a professional school, for example, and 
if only the highest 100 were to be selected, the ratio would be .50. When- 
ever there is a reliably established positive correlation between the test 
Scores and the criterion of performance, any degree of selectivity will to 
some extent yield a group of persons whose promise is greater than that 
of the whole group, the degree of superiority depending upon the size 
of the test’s validity coefficient. For example, assume we have an aptitude 
test that correlates .50 with the criterion (for example, level of per- 
formance ina professional school). Assume also that of the candidates for 
selection, only one half will be chosen. The SR, therefore, is .50. We might 
ask ourselves the following questions, among others. What are the prob- 
abilities that the scholastic performance of those in the selected group 
will place them in the upper half of the levels of performance of an un- 
Selected group? In the present instance the chances are 67 in 100. 

Any validity coefficient of a test and any selection ratio may be used in 
Combination to predict the probable average performance of a selected 
Broup at any level of success. Assume that only the upper 50 percent of 
àn unselected group are to be chosen. What percent of those retained are 
likely to prove satisfactory if the validity coefficient is of a certain size? 
For example, if the coefficient is .20, the percent satisfactory will probably 
be 56; if the coefficient is .go, the percent is 60; if the coefficient is .40, the 
Percent is 63; if the coefficient is .70, the percent is 75. Tables of relation- 
ships have been calculated for combinations of selection ratios ranging 
from :05 to .95 and coefficients of validity from .oo to 1.00. These tables 
Provide a useful general index for estimating the probable value of a 
test of given validity in a given sitiation. 

It is desirable to re-emphasize that this technique yields indexes ap- 
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plicable to groups, but it does not show what will ee ee ie 
the case of a particular individual within that group. The utiliza es 
the selection ratio does, however, add to the utility of aptitude tests v a 
group trends and a minimum of elimination from a job or rea 
study are the first concerns. When the individual himself is the " 
concern—as he is to clinician and counselor—aptitude-test results shou 
be used only as one source of information in a total picture.10 
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TESTS OF EDUCATIONAL 
ACHIEVEMENT 


Nature and Scope 


A test of educational achievement is one designed to measure 
knowledge, understanding, or skills in a specified subject or group of sul . 
jects. The test might be restricted to a single subject, such as arithmetic, 


yielding a separate score for each subject and a total score for the several 
subjects combined. 


"Tests of educational achievement differ from those of intelligence in 


that (1) the former are concerned with the quantity and quality of learn- 
or group of subjects, after a period of 
neral in scope and are intended for the 
hological processes, although they must 
ed content that resembles the content 


tions, ability to solve problems, 
apply generalizations 
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knowledge, comprehension, application, analysis, synthesis, and evalua- 
tion. Each of these is analyzed into several aspects. For example, knowl- 
edge is divided into knowledge of specifics, ways and means of dealing 
with specifics, universals and abstractions in a field. These, in turn, are 
broken down further into knowledge of: 

specific information 

terminology 

ways and means of presenting ideas and phenomena 

trends and sequences 

classification and categories 

criteria 

methodology 

major ideas 

principles and generalizations 

theories and structures 
Comprehension is divided into three aspects: translation (ability to 
paraphrase), interpretation, and extrapolation. Analysis is regarded as 
being of three types: analysis of elements, relationships, and organiza- 
tional principles. Comparable aspects are given for the other major ob- 
Jectives.t 

It is apparent that several of these objectives and their divisions en- 
gage the higher, more complex mental processes, going beyond only the 
acquisition and retention of information; although it is also clear that 
Specifics are essential. 

Currently, also, emphasis is being placed upon tests that measure edu- 
cational achievement in broad areas, such as the social studies, natural 
Sciences, and humanities, rather than in the specific and restricted divi- 
sions of each of these. The trend is particularly marked in the senior year 
in high school and at the college level. The newer types of tests are out- 
growths of changes in conceptions of curriculum organization, educa- 
tional objectives, and teaching methods: namely, that courses and units, 
where possible, should be “interdisciplinary”; they should include the 
several relevant areas of study rather than being segmental; and they 
should contribute to the development of complex mental functioning. 

Although educational achievement tests, especially those of the newer 
types, involve some of the psychological processes measured with tests 
of intelligence, they differ from the latter in that they utilize specific ma- 
terials from an area of subject matter in which instruction has been pro- 
vided. Their purpose is to measure how much has been learned in the 
S gna what specific abilities or skills have been developed. The 

ntelligence, on the other hand, is intended to provide indexes 


*Sar i 3 : > 
, "ample test items are given in a later section of this chapter. 
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representing an individual's level of general mental inpr caede 
quality. The line of demarcation between the two is not a EN ^ "P; 
and the correlation between them is quite significant. But there are clear 
distinctions in regard to details of content, emphasis, and purpose. 

The number of tests of educational achievement is very large. They 
cover almost every subject taught in elementary and secondary schools 
and colleges; and they vary considerably in merit. Some are broad in 
range of school grades and extent of subject matter, while others are 
relatively restricted. Still others are devised not only to measure the 
amount learned but, also, to diagnose difficulties. For example, some tests 
in arithmetic are intended to reveal pupils' weaknesses in each of the four 
fundamental processes or in the several types of skills (for example, frac- 
tions, decimals) 
ing deficiencies as in rate, vocabulary, and comprehension. 

The principles upon which tests of educational achievement are stand- 
types already presented; the same 
& validity, and reliability apply 
educational achievement should 


test items should be carefully phrased and analyzed for their discrimina- 


Uses 


Tests of educationa 
of uses in educational instit 
in business, 


l achievement have been put to a number 
utions at all levels, in psychological clinics, 
industry, and government. In elementary and secondary 


schools, in particular, they are used to evaluate teachers’ effectiveness, 
when employ: 


been used, also 


Purposes it is necessary to use 
g devices; that is, tests that have been 
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(often referred to as "ability grouping") for purposes of differentiated 
instruction. The results of the tests may also be used to provide a basis 
or criterion of motivation for an individual, since he can more readily , 
measure his progress on a standardized scale than by other available 
means. Thus the progress of pupils can be evaluated over a period of 
time, by the pupils themselves and by their teachers. 

Reliable and valid achievement tests, whether constructed by specialists, 
for widespread use or by a group of teachers for use only in their own 
School, are indispensable in the guidance of individual pupils; for they 
enable teachers and counselors to diagnose each pupil's strengths and 
weaknesses. Such diagnosis is necessary to plan remedial instruction and 
to assist in the selection of a future course of education to be followed 
by a given individual. 

Selection of pupils and students by institutions for particular types of 
education is the converse of guidance of the individual himself. Accord- 
ingly, many secondary schools, colleges, and universities administer ob. 
jective achievement tests as one criterion to be considered in their process 
of selecting and eliminating applicants for admission to various types of 
education. Many colleges, for example, require that their applicants take 
the objective tests of the College Entrance Examination Board. Some uni- 
versity graduate schools require applicants to take the Graduate Record 
Examinations. 

The clinical psychologist often uses the results of educational achieve- 
ment tests when dealing with individuals at the elementary, secondary, 
or college level. They are individuals whose adjustment problems are 
associated with deficiencies or inabilities within certain subjects of study. 
Paradoxically, the problems of some pupils are attributable to superior 
aptitude or achievement in certain school subjects which, however, are 
Unrecognized and unutilized by their teachers. Results of standardized 
tests often disclose suéh a condition. The clinical psychologist might, also, 
utilize achievement-test results as part of a total case history. 
^ Achievement tests are now widely used outside of educational institu- 
tions, State and Federal civil services include objective tests among their 
requirements for appointment, as do some businesses and industries for 
certain of their positions. One of the requirements for obtaining a license 
9r a certificate to practice a profession (for example, law, dentistry, nurs- 
ing, psychology, medicine) is the passing of a state examination (or na- 
tional, in some professions), at least part of which is of the objective type. 
Since examination by means of objective tests is widespread and since 
decisions depend in part upon test scores, it is imperative that these de- 
vices be constructed jointly by professional persons of two kinds: (1) 


TESTS OF EDUCATIONAL ACHIEVEMENT 
494 


i i hose 
those who are expert in the techniques of test construction, and (2) e 
j i : ior 
who are specialists in the subject-matter area for which the examina 
is to be prepared. 


Derived Indexes 


Educational Age and Achievement Age. Scoring of educational 
tests does not involve unique problems or principles. Their raw meal 
may be converted into the familiar percentile ranks and standard ee: 
or into “educational age" or "achievement age," by means of a tab D 
age norms. Raw scores may be converted, also, into grade equivalents, 
using a table of grade norms. ith 

The reader is already familiar with the concept of the norm and wit 
the method whereby it is derived. Thus, if a pupil's raw score on a stand- 
ardized test of the four fundamental arithmetical processes gives him a 
grade equivalent of 4-5, it means that his achievement on this test is Momm 
to that of the average of pupils who have completed half of Wo ie 
grade. By means of this index, obtained for each of the several subjects o 
instruction, it is possible to get a profile for each pupil and thereby T 
evaluate his general level of achievement, the evenness or unevenness Oo 


his performance, his weakness and strength in the type of learning meas- 
ured, 


The educational a 
several subjects in w 


pupil has been given a test 
reading rate and com 


Achievement age (AA) designates a P 
single school subject. The “subject age” 


€ norm of 10-year-old pupils, as 
measured by the given test. If "subject age" is used, it would be said 


that this pupil’s “arithmetic age” is 10, At present, it is a matter of per- 
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sonal choice whether to designate a pupil's performance on an arith- 
metic test as "achievement age (arithmetic)" or "arithmetic age." ? 

Norms. Grade norms, educational ages, and achievement ages 
derived for different tests are not always comparable, nor are they always 
applicable in a given school or community, since the standardization 
populations of these tests vary in respect to adequacy and representative- 
ness. This is an important problem in achievement testing, since quality 
of education is far from uniform among various parts of this country. 
Rather than deriving only national norms, it is much more meaningful 
to present, in addition, separate norms for different sections of the coun- 
try, and even for different types of communities (according to popula- 
tion). In connection with norms, it is necessary to recall the earlier dis- 
tinction made between norms and standards (Chapter 6). Norms, in this 
instance, will represent the levels of manifested achievement in the subject 
matter being tested. They do not necessarily represent the level of learn- 
ing that might be desirable or optimal—that is, a standard of per- 
formance. 

Educational Quotient and Achievement Quotient. It is neces- 
Sary to have a quotient to accompany an educational or an achievement 
age. Hence, there are two types of quotients used with tests of educational 
achievement when EA or AA is found; these are the educational quotient 
(EQ) and the achievement quotient (AQ). The latter is sometimes called 
the accomplishment quotient. 

The educational quotient is the ratio of educational age to chrono- 
logical age (EA/CA) multiplied by 100 to remove the decimal; that is, the 
individual's average level of measured learning in relation to what is 
€xpected on the basis of his life age. Theoretically, the "normal" EQ is 
100; deviations above or below represent, respectively, superior or in- 
ferior school learning, as compared with the individual's age group. 

The achievement quotient is the ratio of educational age to mental 
age (EA/MA), multiplied by 100; that is, the individual's average level 
of measured learning in relation to what is expected on the basis of his 
Mental level. Due to marked individual differences in mental ability in 
any group of persons of the same life age, the mental age is regarded as a 
More reliable index of a person's learning capacity than is chronological 
age; therefore, the AQ is a more valuable index than the EQ in judging 
Whether or not a pupil's school achievement is commensurate with the 


md and quality of learning that might reasonably be expected of 
im. 


^ At present the term "achievement age" is in disfavor, as is the term "achievement 
Q'otient," defined later in this section. 
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The educational quotient and the achievement quotient have been 
used not only when the composite EA is derived and taken as the nu- 
merator of the ratio, but also when only the age level for a single subject 
(subject age or achievement age) is used as the numerator. When this is 
the case, then the quotients should be read as “educational quotient 
in ..." (naming the school subject tested; for example, arithmetic); 
"achievement quotient in . . ." 

The wisdom of using the AQ has been seriously questioned, for the 
following reasons. (1) Since this index is derived from two separate tests, 
each of which has a certain degree of error or unreliability, it has a lower 
reliability than an index based upon either test alone. (2) Since the norms 
of the two tests, one of educational achievement and the other of general 
mental ability, have in all probability been derived from different stand- 
ardization groups, they will not be strictly comparable, nor will their 
measures of dispersion. For example, unless the distributions of scores of 
the two standardization groups are approximately equal, a given EA (10) 
and EQ (110) will not be comparable with an MA of 10 and an IQ of 110. 
Unless the distributions are equivalent, or nearly so, the two sets of ap- 
parently identical indexes will have different values, more or less. Hence, 
an index derived from them will not have the same meaning as either one. 
(3) Since courses of study in most schools are geared to pupils of average 
ability, bright pupils will not have an opportunity to learn up to a level 
consistent with their capacities; hence, they will be penalized and will get 
AQs of less than 100. Slow pupils, on the other hand, will be working 
"over their heads," and will more often get AOs above 100.3 In spite of 
these criticisms, however, and assuming that one is aware of its limita- 
tions, there are occasions when the AQ serves a useful purpose, especially 
in studying the educational problems of superior children. 


Types of Items 


. Test items may be classified into one or another of the following 
major types: (1) simple recall; (2) two alternatives, such as true-false, right- 
wrong, yes-no; (3) multiple-choice; (4) completion; (5) matching; (6) 


analogies; and (7) check lists. The following examples illustrate these 
types. 


(1) Simple recall: 


What are the dat 


SXIOPSWOFIdINVarcq o E en oh 
What are the tw, 


© main gases found in water? 


"This criticism does not seem to have much 
the AQ helps to re 


eveal an undesirable educati 
are concerned. 


merit; for what it says, in effect, is that 
onal situation, so far as superior pupils 
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(2) Two alternatives: 


NaCl is the chemical symbol for common salt. TF 
Abraham Lincoln served two complete terms as president of the 
United States. T F 


(3) Multiple-choice: 
Scrooge is a character in 
1. Oliver Twist 
2. David Copperfield 
9. A Christmas Carol. 
An example of an appointed official is a 
1. congressman 
2. senator 
3. federal judge. 
(4) Completion: 


The executive head of the United States government is the 
while the federal legislative bodies are the ________ and the 


(5) Matching: 
Directions: After each name write the number of the topic which is 
intimately associated with that person. 


1. conditioned reflex JUJtclener a eoe exces 
2. age scale for testing intelligence Hal uewcsenesrEitiexes 
3. reaction-time experiments RANON E tee civ doy er 
4. psychoanalysis Cattell; J. M oocsescescess 
5. psychology of adolescence Freud yak Hos oeni enses 
6. existential psychology i T TEE ere dues 


7. factorial analysis 
(6) Analogies: 
Executive functions : President : : legislative functions : ^ 
Hydrogen : H : : sodium : 
(7) Check lists: 
Directions: Place a check mark in front of each of the following items 
Which is part of an automobile: 


throttle generator 
rudder distributor 
gear shift aileron 


periscope stabilizer 


Although these are the principal types of items used in tests of edu- 
cational achievement, they are not found with equal frequency; true-false, 
multiple-choice, and completion are the most common. There are also 
Variations on some types. For example, tests of paragraph meaning (in 
literature, social sciences, physical sciences, and others) are quite common. 
The testee reads a paragraph and is then required to answer questions 
tended to show the extent to which he has comprehended it. The ques. 
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tions usually are in the form of true-false, multiple-choice, or PE aed 
In arithmetic and other aspects of mathematics, the items in a a 
ardized test usually are in the same form as those devised by the indi- 
vidual teacher, that is, the student simply provides the correct answer. In 
tests of some aspects of English, special types of items have been E 
veloped, as, for example, in tests of punctuation and capitalization. In the 


following illustration, the necessary capitals and marks of punctuation 
are to be supplied. 


why did you come home so early mary asked 


For purpose of comparison, the following items are presented to 
illustrate the content of the newer types of test items, although their 
form is the same as that of other multiple-choice items.‘ 


To test knowledge of “conventions”: 
For computation purposes, forces are frequently represented by 
1. straight lines 
2. circles 
. areas of a circle 
. angles 
- objects of three dimensions 


Grp bo 


To test knowledge of methodology: 
When a scientist is confronted with a problem, 
ing it should usually be to 
1. construct and purchase equipment 
2. perform an experiment 
3. draw conclusions 
4. use other scientists to cooperate with him in working it out 
5- gather all available information on the subject 


his first step toward solv- 


To test analysis of elements: 


1. Galileo investigated the problem of the acceleration of falling bodies 
by rolling balls do 


wn a very smooth plane inclined at increasing angles, 
since he had no means of determining very short intervals of time. From 
the data obtained, he extrapolated for the case of free fall. Which of the 
following is an assumption implicit in the extrapolation? 
1. That air resistance is negligible in free fall 
2. That objects fall with constant acceleration 
3. That acceleration observed wi 
as that involved in free fall 
4. That planes are frictionless 
5. That a vertical plane and one w 
same effect on the ball 


“From Bloom 
examine in detail 


th the inclined Plane is the same 


hich is nearly so have nearly the 


(3). By permission. Students are advised to consult this volume to 
the wide range of item types. 
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To test analysis: 

Items 28 and 29 are based on a composition which is played during the 
test. Number 28 calls for analysis of the systematic arrangement or structure 
which makes the composition a unit. Number 29 tests such an objective as 
ability to analyze, in a particular work of art, the relations of materials and 
means of production to the elements and to the organization. 

28. The general structure of the composition is 
1. Theme and variations 
2. Theme, development, restatement 
3. Theme 1, development; theme 2, development 
4. Introduction, theme, development 
29. The theme is carried by 
1. the strings 
2. the woodwinds 
3. the horns 
4. all in turn 


Representative Batteries 


Educational achievement tests are extensive in range and num- 
ber and they vary in merit. Whenever a psychologist or educator has to 
evaluate the educational achievement of a group or of an individual, the 
selection of tests must be based upon their appropriateness to the problem 
at hand and upon the adequacy of their standardization: that is, range of 
Brades covered, aspects and comprehensiveness of subject matter covered, 
reliability, validity, and population sample from which norms were de- 
rived. In this section, we shall list only several of the sounder batteries as 
representative of the group. When these are compared, it is apparent that 
they have much in common, both as to areas covered and content, al- 
though each has some individuality. 


California Achievement Tests (1957): Forms for grades 1-2, 3-4, 4-6, 7-9, 
9-14. Tests in reading vocabulary, reading comprehension, arithmetical 
fundamentals, arithmetical reasoning, mechanics of English, spelling. 

Iowa Tests of Basic Skills (1956): Grades 3-9. Tests in vocabulary, reading 
comprehension, language, arithmetical skills, work-study skills. 

Metropolitan Achievement Tests (1959): Forms for grades 1, 2, 3-4, 5-6, 
A Tests in vocabulary, reading, arithmetic, science, social studies, study 
skills. 

SRA Achievement Series (1957): Forms for grades 2-4, 4-6, 6-9. Tests in 
reading, language perception, language arts arithmetic, work-study skills. 

Sequential Tests of Educational Progress (1958): Referred to as STEP. 
Forms for grades 4-6, 7-9, 10-12, college. Tests in reading, writing, mathe- 
Matics, science, social studies, listening comprehension, essay writing. 

Standard Achievement Test (1953): Forms for grades 1-3, 3-4, 5-6, 7-9. 
Tests in arithmetic, reading, science, study skills, social studies, 
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Some batteries are constructed for use only at the high school level: for 
example, Essential High School Content Battery (1951), Evaluation and 
Adjustment Series (1950), Iowa Test of Educational Development (1952). 
These, naturally, place more emphasis upon specific sciences, social 
studies, and upon general educational development in thinking with and 
using materials learned. 

Evaluation. These and other equally sound batteries of tests 
have much in common regarding objectives, standardization population, 
reliability, and conceptions of validation. On the whole, they are tech- 

.nically well constructed. They have been current long enough, having 
gone through revisions, so that their normative data are based upon the 
Scores of many thousands of individuals. Their reliabilities are high, 
mostly in the high .80s and low -gos. Validity of achievement tests is given 
primarily in terms of content and constructs. These criteria of selecting 
test materials are justifiable. The value of an achievement battery will 
depend upon (1) the thoroughness with which the test materials were se- 
lected; (2) adequacy of item analysis to determine the discriminating 
power and difficulty level of each (see "Item Analysis," Chapter 5); and 
(8) adequacy and differentiating value of norms, as between chrono- 
logical age groups and grade levels. 

2 If an achievement test is valid, it measures what is actually being taught 
in the schools for which it is intended. The procedure followed in con- 
structing the sounder achievement tests, therefore, is essentially as follows. 
The widely used textbooks in the subject are thoroughly analyzed; a 
variety of course syllabi are analyzed; educational objectives of the courses 
of instruction are defined; information (content validity), skills and in- 
tellectual processes (construct validity) to be tested are defined; an analysis 
is made of research findings relating to pupils’ experiences, forms and 
levels of concepts, and vocabularies at successive ages and school grades. 
On the basis of these analyses, outlines of test content are prepared, in 
to all the skills, types of information, and 
xamined. The outlines of content and pro- 
y subject-matter specialists. After this has 
the test is ready for a series of tryouts, in- 
P on techniques, duri ich items are re- 
vised, added, and eliminated, until rA HEADS, and validis are 
achieved, 


These batteries 


are intended to be 
these ways: ( 


analytical of pupil achievement in 
1) they measure conti 
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are so markedly deficient in any area as to require more intensive testing, 
Observation by their teachers, and diagnosis of the elements of deficiency, 
with a view to remedial instruction. The comprehensive survey batteries 
are not designed primarily for this kind of diagnosis, though they con- 
tribute to this end in a limited degree. For example, if a pupil earns a 
low score on the arithmetic test in the battery (for reasons other than 
poor general intelligence), a diagnostic test in arithmetic (either stand- 
ardized or especially devised) should reveal in which of the four funda- 
mental processes, and at which levels, he is deficient; or in which number 
combinations, or specific skills and understandings, he is weak. 

There are several important and complicating considerations that 
must be taken into account when an educational achievement test is being 
constructed and when it is being evaluated for use in a particular school. 
As already stated, the tests are based upon analyses of textbooks, course 
Syllabi, and judgments of experts. For each school or school system how- 
ever, the appropriateness of the tests content must be judged. The ques- 
tion is: Are the school's courses of study, syllabi, and objectives com- 
parable? If they are not, the pupils' scores will not be valid indexes of 
what has been taught and learned, except for the purpose of comparison 
With the schools and the rationale used in the standardization process.5 

Most sound achievement batteries provide grade norms and percentile 
norms within each grade. It is highly desirable, also, to provide grade 
norms for the beginning, middle, and end of the school year; for, obvi- 
ously, there should be measurable development during the year. Cur- 
rently, however, the use and interpretation of norms present difficulties. 
"There always have been, and there still are, sectional differences in the 
quality of schooling provided in the several geographic areas of the 
country. “National norms,” therefore, especially when they stand alone, 
are of doubtful value; for they might not be representative of any schools 
Or groups of schools in particular. It is desirable, for this reason, to cal- 
culate separate norms for geographic areas, or even for individual states, 
Where the population sample has been representative. The meaning of 
grade norms is further vitiated by the different policies followed in 
Promoting pupils into successive grades. Many schools do not use a stand- 
ard of achievement ("passing"); pupils are moved along after two years 
1n à grade, regardless of competence. (In this connection, see Chapter 6 
9n the difference between norms and standards.) Other schools follow 
different practices. Grade norms, therefore, have a different significance 
from place to place. Some test authors meet this problem by providing 


TA: well-conceived, and soundly constructed test can stimulate improvements in the 


Courses of some schools for whose pupils the content P a 
View of what they have been taught. 
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modal age-grade norms (Stanford Achievement Tests, for example); that 
is, the norms are based upon the scores of the most common age group 
in each grade rather than upon the scores of all pupils in the grade.5 The 
modal-age group consists of those pupils who are typical of the grade with 
respect to age; they have been in each grade one year and entered school 
at about the same age. Furthermore, the age requirement for entering 
school is approximately the same throughout the United States. 

On the whole, soundly conceived and constructed achievement test 
batteries can contribute much to the study and solution of the educa- 


tional and psychological problems indicated under the section on “Uses” 
in this chapter. 


Reading Tests: 


Reading, of all school subjects, is the one to which most research 
has been devoted and for which the largest number of tests have been 
devised. This is quite understandable, since ability to read is the founda- 
tion of the usual curricula through high school, and also of most courses 


groups, although the 
diagnostic. 
Achievement, 


In Connection with re 
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Readiness Children are ordinarily admitted to the first grade 
on the basis of age, the assumption being that when«a child reaches a 
specified age (usually 6 years), he is ready to begin the study of prescribed 
first-grade subjects. Extensive research on individual differences has re- 
vealed, however, that an appreciable percentage of children are not ready 
for formal instruction in reading at the prescribed time owing to im- 
maturity of some types of perception, although these children are not 
necessarily retarded in general mental development. Tests of reading 
readiness, therefore, are devised to identify children who are not mature 
enough to benefit from instruction. The tests, the elements of which are 
listed below, are representative of those found valuable in evaluating a 
child's status. 


American School Reading Readiness Test (1955). Grade 1. 
1. Vocabulary 
2. Discrimination of letter forms 
3. Discrimination of letter combinations 
4- Word selection 
5. Word matching 
6. Discrimination of geometric forms 
7. Following directions 
8. Memory of geometric forms 
Gates Reading Readiness Test (1939). Grade 1. M è 
1. Following directions in marking pictures (ability to listen, under- 
stand directions, and remember) 
2. Word matching (identification of visual word patterns) 
3. Word perception (selecting one word from among four) 
4. Rhyming (auditory perception) 
5. Naming letters and numbers 4 
Harrison-Stroud Reading Readiness Tests (1956). Kindergarten and Grade 1. 
1. Making visual discriminations 
2. Using context 
3. Making auditory discriminations : 
4. Using auditory clues in identifying items 
5. Using symbols 
6. Giving names of letters 
Metropolitan Readiness Tests (1950). Kindergarten and Grade 1. 
1. Word knowledge (selecting pictures that correspond to words)! 
2. Understanding of and response to oral directions (selecting pictures 
in response to sentence-long directions, involving sustained attention) 
3. Information (same as 2, but more elaborate, involving more vocabu- 
lary and names of common objects). ' . : 
4. Matching (visual perception, involving selection of pairs of identical 
Pictures of common objects) 
5. Knowledge of numbers 
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6. Copying (simple geometric forms and less complex numerals and capi- 
tal letters) 


"Tests of reading readiness are based upon empirical —— on 
analysis of aptitudes involved in children’s learning to ies s um 
psychological analysis of the functions involved. Inspection o pen "e 
parts of the foregoing tests shows extensive agreement, in — Mie 
ferent names and emphases. On the whole, reading readiness is — 
in terms of sensory development and acuity, language development e 
interest, curiosity about the environment (information, vocabulary), and, 
to a lesser extent, motor control and rate (since learning to write is 
collateral with learning to read). . ined 

These and comparable tests have demonstrated their value. d 
ratings will indicate whether a child may be expected to experience us i: 
moderate, or little difficulty in learning to read. The scores of the sounder 
tests are significantly correlated (in the .7os) with later achievement 
reading in the lower grades. Analysis of a child's performance on each o 
the parts can reveal areas of deficiency and possible sources of mus 
difficulty. Conversely, the tests provide information that will enable teach- 
ers to identify pupils who are equipped to learn without difficulty. 

Diagnostic. "These tests are intended to isolate in detail the ele- 
ments responsible for a pupil's disabilities in reading. The following in- 
struments, and the lists of their parts, are representative. 


Diagnostic Reading Tests (1952). Grades 7-19. 


1. Vocabulary: in the fields of English, mathematics, science and social 
studies 


2. Comprehension: silent and auditory 

3. Rate of reading: general, social studies, science ] 

4- Word attack: oral and silent—identification of sounds, syllabication 
Durrell Analysis of Reading Difficulty (1955). Grades 1-6. 

1. Oral reading comprehension 

2. Oral reading recall 

3. Silent reading 

4. Word and letter recognition 

5- Word pronunciation 

6. Spelling 

7. Handwriting 
Gates Reading Diagnostic Tests (1953). Grades 1-8. 

- Oral reading errors (omissions, 
- Vocabulary 
- Phrase perception 


reversals, mispronunciations, etc.) 
2 
3 
4. Visual perception (syllabication, 
5 
6 


blending, letter sounds, etc.) 
- Auditory perception 


- Spelling 


READING TESTS 505 


Gilmore Oral Reading Test (1952). Grades 1-8. 
Substitutions 
Mispronunciations 
Assistance in pronunciation 
Disregard of punctuation 
Insertions 
Hesitations 
Repetitions 
Omissions 
Group Diagnostic Aptitude and Achievement Test (1939). Intermediate 
Form, Grades 3-9. 
1. Paragraph meaning 
2. Speed of reading 
Word discrimination: vowels, consonants, reversals, additions, omis- 
sions 
4. Letter and form memory (visual) 
5. Auditory discrimination 
6. Copying text and crossing out letters 
7. Vocabulary 
Rosswell-Chall Diagnostic Reading Test (1956). Grades 2-6. 
1. Word recognition and word analysis 
a. single consonants and combinations 
b. short vowel sounds 
c. rule of silent e 
d. vowel combinations 
€. syllabication 


SI Sew p or 


qo 


Inspection of the preceding lists shows that the several tests have much 
in common. It will be noted that these and other current diagnostic tests 
in reading are concerned very little, if at all, „with detecting visual 
anomalies, or with auditory and visual deficiencies that interfere with 
learning and with progress in reading. Early tests in this field placed more 
€mphasis upon these elements; and, it seems, wisely so. 

Evaluation. Of the many tests of reading, too large a per- 
centage are inadequately standardized; they do not provide the essential 
data on reliability and validity. These deficiencies are especially serious 
In the case of tests of reading achievement. Although it can be done, it is 
much more difficult to determine validity of readiness and diagnostic 
tests, because their effectiveness as predictors has to be estimated in terms 
of either progress at the earliest stages (readiness) or elimination of de- 
ficiencies and handicaps. The degree to which these objectives are 
achieved will depend upon a number of elusive determinants, such as the 
child's motivation and general intelligence, the quality of teaching in the 
regular classroom, the technical skill of the remedial teacher, and the im- 
Provability of a particular defect. 
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The sounder tests are reasonably reliable, and their validity " lese 
upon content or construct validity, or both. When they pr et MR 
thorough analysis of reading materials at each of the several gra cie : 
lowed by analysis into the psychological processes involved in reading 
at each of the levels, the tests can have considerable merit. ‘The more 
analytical and more thorough tests not only require more time to ad- 
minister, but they demand greater skill and insights in scoring E 
terpreting. It is possible, however, to get useful clues from a brief and "E 
complex device (for example, the Roswell-Chall), to be followed, if neces 
sary, by a more extensive and analytical examination. 


Arithmetic Tests T 


Characteristics. Since it is a basic and universal area of in- 
struction, the subject of arithmetic has also been widely studied, and a 
large number of tests have been prepared. Too many of the devices 
in this field, however, fail to satisfy the criteria of sound test construc- 
tion. The most satisfactory available objective tests of arithmetical ability 
are parts of the sounder comprehensive batteries, such as those already 
described. "T 

A valid test in arithmetic is designed to measure the specified objectives 
of the teaching of that subject at each of the grade levels. These objectives 
can be determined only by means of a thorough study of textbooks, syl- 
labi, and research publications. This means, of course, that tests designed 
for one educational level, or a few successive levels, will differ in content 
and emphasis from those devised for higher or lower levels. We find, there- 
fore, that in the aggregate, practically all aspects of arithmetical instruc- 
tion are included in available tests. The following are the understandings 


and operations most frequently included, both in specialized tests and in 
comprehensive batteries. 


- The four basic procesess in a great variety of combinations 
- Manipulation of whole numbers 


- Manipulation of fractions 

. Manipulation of decimals 
Manipulation of mixed numbers 

- Arithmetical terms and c 

- Percentage and interest 

- Measurement 

- Number meaning 


- Arithmetica] reasoning (problem solving) 


oncepts (for example, average) 


CO ouo Ro M 


- 


Types. Objective tests in arith 


x metic are of two major types 
achievement and diagnostic, although, 


as in reading, they cannot be 
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mutually exclusive. The achievement type, obviously, is intended to 
measure the amount and level of each testee's learning. To accomplish 
this end in al! respects is extremely difficult; for among the objectives in 
modern teaching of the subject are understanding of arithmetical prin- 
ciples and generalizations, and insights into arithmetic as a system—an 
organized and orderly method—of thinking. It is not surprising that most 
of the available devices are concerned almost exclusively with manipula- 
tions of one kind or another and with the solving of the usual types of 
problems, since it is quite difficult to devise test items that will evaluate 
the extent to which the more complex aims have been achieved. Thus, in 
spite of the emphasis, in the last decade and before, upon developing 
meanings, understandings, and insights in teaching arithmetic, few tests 
provide adequately for these functions. Some progress, however, has been 
made. The following are illustrative of items used to test the ability to 
evaluate the nature and sufficiency of data (41). 


Level 4 (Grades 4-6) 


Situation: In Tom's school, some children ride bicycles, some walk to school, 
and some ride the school bus. The pupils on the safety patrol have to come 
early. 

11. Two children from each class in the school were members of the safety 
patrol. To find how many patrol members there are altogether, what 
other fact would you need to know? 

4 'The number of children in the school 
B 'The number of classes in the school 
C The number of children in each class 
D The number of street crossings 


Level 3 (Grades 7-9) 


Situation: Mr. Jones has a dairy farm. ; 
8. Three fourths of Mr. Jones’ cows are Jerseys. Two thirds of Mr. Brown’s 
cows are Jerseys. Mr. Jones has fewer cows in his herd than Mr. Brown. 
Which of the following statements about the number of Jerseys they 


Own is true? 

E They have the same number. 

F Jones has more than Brown. 

G Brown has more than Jones. 

H More information is needed to find out who has more. 

9- Mr. Jones has two fields of equal size. if he grows corn on 74 of field R 
and 8% of field S, which of the following statements is true? 
4 Field R has more space in corn than field S. 

B Field S has more space in corn than field R. 
C Equal space is in corn on both fields. 
D The spaces in corn cannot be compared. 
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The following is a type of item devised to test ability to understand the 
meanings of processes. 


: : in 
Which of these examples will have the same answer as the example i 


the box? (30, p. 147) (Grades 4-5) 
14 of 27 


(2) 27 (b) 27 (c) g/s7 (d) 27 
m ts X8 


Attention to the more complex and generalized outcomes of instruc- 
tion in arithmetic should not obscure the fact that mastery of the funda- 
mental arithmetical processes and of techniques of calculation are basic 
and should also be measured both in achievement and diagnostic tests. 

Diagnostic tests in arithmetic must be based upon the same principle 
as those in reading. Each must be specialized and detailed in order to i 
veal not only areas of weakness and deficiency but also the specific dif- 
ficulties within each area. Thus, for example, it is not enough to find 
that a pupil is weak in subtraction; it is necessary to know which specific 
Steps and number combinations are the cause of the difficulty. In working 
with decimals, the meaning and placement of the point might be the 
source of trouble; or particular number com 
might be troublesome. 


When a diagnostic test is given individually, it is desirable for the testee 
to work his problems and make his operations i 
form, while the examiner records the pupil's responses step by step. This 
examination should be followed by an interview. Thus the errors, faulty 


methods, and incorrect reasoning may be detected and remedial instruc- 
tion specifically directed. 


Ability in arithmetic is a function of 
find the cause (or causes) 
mental ability, preferably 
determine whether the dis 
mental level or to a speci 


binations in multiplication 


orally as well as in written 


general intelligence. In trying to 
of disabilities in this subject, a test of general 
an individual test, should be administered to 
ability is attributable to an inadequate general 
fic deficiency that should be readily remediable. 


Tests at High School and College Levels 


zw o 
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In the secondary schools themselves, the tests are used to measure prog- 
ress and development in subject matter. Among these are the Cooperative 
Achievement Tests (Educational Testing Service), the Evaluation and 
Adjustment Series (Harcourt, Brace & World, Inc.) and the Iowa Tests 
of Educational Development (Science Research Associates, Inc.).8 

The College Entrance Examination Board, as many students already 
know from first-hand experience, has constructed many editions of 
achievement tests, in a variety of secondary school subjects, to be used as 
one criterion of college admission. Some of these measure not only 
mastery of subject matter, but place emphasis upon ability to apply prin- 
ciples in new situations within the subject-matter area. Thus, they are 
intended to serve as measures of what has been learned and as predictors 
of future performance with similar materials in college courses. In addi- 
tion, the College Board has introduced, within relatively recent years, 
Advanced Placement Examinations for high school seniors. These, in each 
subject, are based upon content equivalent to a first course at the college 
level. They thus encourage superior high school students to take ad- 
vanced work and offer these students an opportunity to enter sooner into 
more advanced studies, commensurate with their abilities, at the college 
level. 

Since 1937, objective tests of educational achievement for college gradu- 
ates have been developed. They are the Graduate Record Examinations 
(Educational "Testing Service). An increasing number of universities are 
using these as one of the criteria for admission to their graduate schools. 
The Examinations consist of three parts: an Aptitude Test, Area Tests, 
and Advanced Tests. The first of these is a test of general mental ability, 
yielding separate scores in verbal and quantitative subtests and intended 
to serve as one predictor of ability to do graduate work. The Area Tests 
(for the college sophomore year and higher) provide examinations in 
Social science, humanities, and natural science. They constitute a general- 
achievement battery designed to estimate the level and extent of a stu- 
dent's knowledge and understandings in the major areas of college studies 
(liberal arts and sciences). The items include some that evaluate the 
Student's ability to solve problems in general fields of study and to exercise 
Judgment based upon familiarity with academic subject matter. The Ad- 
vanced Tests are specialized examinations to test students in their subjects 
of major study and intended graduate specialization. Data on the predic- 
tive validities of the GRE have shown, as in other comparable situations 
(for example, law and medicine), that a combination of their scores with 
undergraduate grades provides a better predictive criterion than either 
one taken alone (20, 21). 


LI i i à; 
Students should consult Buros (6) for descriptions and evaluations of these series and 
OF numerous others. 
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Tests of Complex Educational Objectives 


Although, as already explained, some tests of Menta 
achievement have been including items that go beyond the pee ieu 
of information and rather routine skills, a few have been m ite 
cifically to estimate the more complex and enduring ee 
tives. One of the earliest of these is the Wrightstone Test of K 
Thinking in the Social Studies (1939). It included items sag Dn 
following three types of operations: (1) obtaining facts, (2) c Tes S dem 
clusions, (3) applying general facts. The first. part requires tl nie 
to select certain items from a group of facts in accordance with jn ne 
directions. The correct answer is given to each question, the ie o d 
Pupil being to match answers and given facts. In this part of iude ah 
the pupil is required to obtain facts from graphs, tables, maps, pice Sis 
à textbook type, and to locate information in books, magazines, us ability 
papers in the ways required in a library. This is, in short, a pu o red 
to utilize a variety of kinds of materials in order to acquire inform um 
Part two, called "Drawing Conclusions from Facts," is a reading test s 
requires the pupil to evaluate a number of conclusions drawn from usen 
data. All necessary information is provided, so that recall is not n: 2d 
The procedure consists of reading a paragraph, then matching E 
given statements with it to determine which of them is Approprias is 
drawn from the paragraph. Part three, "Applying General Facts, zi 
much the same as Part two, for the pupil is required to generalize a 
apply facts presented in a paragraph to be read. À i 

Although this Wrightstone test is rarely encountered now, it Was A 
important contribution to educational testing; for its underlying i 
tionale, different from that of the great majoriiy of tests at the time, ha 
since been incorporated in some of the more recent achievement batteries 
and is receiving increasing attention, 

The Watson-Glaser Critical Thinking Appraisal (1942-1956), for grades 
9 through college, is another test of the same general type? It consists 9 
five subtests: (1) drawing inferences, (2) recognition of assumptions, (3) 
making deductions, (4) making interpretations, (5) evaluating arguments. 


Although the content of this test does not call specifically upon knowl- 
edge acquired in c 


tional objectives: namely, 


was developed by the Committee on Appraisal 
on-Glaser Test of Critical Thinking. 
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of the Progressive Education Association, in an eight-year study of learn- 
ing at the high school and college levels. One test, on "interpretations of 
data," includes a series of exercises that require the student to formulate 
reasonable generalizations from data drawn largely from the physical and 
the social sciences. "Application of principles of general sciences," an- 
other test, includes a series of exercises requiring the student to explain 
scientific phenomena in terms of relevant facts and principles. There are 
also tests dealing with the application of social science generalizations to 
social problems; others require the drawing of logical conclusions from 
given premises; still others, in the nature of proof, require the student 
to identify basic definitions and assumptions and to judge their plau- 
sibility. The publications of this committee provided impetus to the 
newer directions taken in testing educational achievement. 

It seems appropriate to include tests of study habits, skills, and attitudes 
under the title of this section. There are many of these. They assess such 
matters as motivation for study, attitudes toward academic work, skill 
in using a dictionary and library index, finding sources of information, 
interpreting graphs and tables of data, and organizing a set of lecture 
notes. Since effective levels in all of these aspects are desirable educa- 
tional objectives, it was to be expected that tests for these skills would be 
devised. 


Prognosis Tests in Specific Academic Subjects 


In the fields of mathematics and foreign languages, at the high 
School level, several tests, intended to predict achievement in these sub- 
jects, have been constructed. They are aptitude tests in the sense that they 
Use restricted and specialized types of materials in an effort to predict 
learning in a limited area.!? These tests are based upon the well- 
recognized principle that a sample of learning in a given subject, ob- 
tained under standardized conditions, provides significant evidence of 
future prospects in that subject. Prognosis tests are not to be confused 
With achievement tests. The former are intended for pupils who have not 
yet studied the subject; the latter are measures of learning after a period 
of instruction. 

In algebra, for example, two of the well-known instruments are the 
Iowa Algebra Aptitude Test and the Orleans Algebra Prognosis Test. The 
first of these consists of four parts: (1) arithmetical problems involving 
only numerical manipulations, (2) verbal problems using arithmetic and 


E " 
E ° These are discussed here rather than in Chapter 18 or 19 because they are concerned 
AMA learning a particular form of subject matter, whereas the chapters on aptitude tests 
eal Primarily with their usefulness in vocational guidance. 
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simple algebraic procedures, (5) number-series exercises requiring the ex- 
aminee to discern the principle used in each series, and (4) exercises re- 
quiring understanding of the effect of one value in an equation produced 
by the variation of another value. Prediction by means of this test, thus, 
is based almost entirely upon skills and understandings previously 
learned. 

The Orleans test is different in conception, in that algebraic operations 
have been analyzed and defined (construct validity). It consists of a 
number of “lessons” in algebra, each followed by a test. The lessons deal 
with basic elements involved in the study of elementary algebra: (1) use 
of symbols to represent numbers, (2) substitution of values for symbols, 
(3) representation of quantities by symbols and use of these, (4) expres- 
sion of relationships by means of symbols, and (5) combinations of the 
preceding four in the solution of problems. Prediction by means of this 
test is based upon the acquisition of new understandings and upon “work 
samples." The Orleans Geometry Prognosis Test is constructed upon the 
same principles as the algebra test. 

Similarly conceived prognosis tests have been prepared for foreign- 
language study. Symonds' Foreign Language Prognosis Test, a very early 
device, attempts to measure this aptitude in "pure" form by presenting 
exercises in Esperanto that are mainly applications of principles to word 


and sentence translation and to the formation of parts of speech. The 


Modern Language Aptitude Test (Carroll and Sapon) is much more 
comprehensive. Like the S 


ymonds tests, it consists of practice exercises in 
learning several aspects of language, for which it utilizes tape-recorded 
and paper-pencil materials. The functions tested are the following, a$ 
described by the test's authors. (1) Number learning: learning numbers 
in a foreign language; a test of memory and "auditory alertness." (2) 
Phonetic script: ability to learn the correspondence between speech 
sounds and symbols; "sound-symbol association ability." (3) Spelling 
pce also “sound-symbol ability”; the scores depend to some extent upon 
j P examinee's knowledge of English, since he is required to associate 
ie gus with words. (4) Words in sentences: "sensitivity to gram- 
r structure.” (5) Paired associates: rote memory exercise in learn- 


EVA foreign language (Kurdish) and the English equiva- 


y mathematics and foreign 
The prognosis tests correlate some- 
5 ; s wi i i - 
Ject for which each is į th grades in the particular sub 
60. Learning a fo 
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plex of the auditory and visual senses that are not significantly correlated 
with general intelligence. The learning of high school mathematics can be 
impeded, in spite of adequate mental ability, by poor preparation in 
arithmetical understandings and by a negative affective attitude. Where 
feasible, it would be good practice to give prognosis tests in conjunction 
with tests of general intelligence, not only to determine the multiple 
correlation (the two tests combined, correlated with grades), cut-off scores 
and expectancy tables, but—and equally important, if not more so—to 
find the exceptions among the pupils; that is, those with sufficiently high 
intelligence-test ranks but with poor ratings on the prognosis tests. Such 
pupils would be studied individually to determine the causes of the 
discrepancies and to provide remedies. Where this practice is not feasible, 
‘the ratings on tests of general intelligence may be used as measures of 
expectancy in learning. Pupils whose scholastic learning and achieve- 
ment are appreciably inconsistent with these measures can then be given 
special tests, followed by interview, to determine the source of difficulty.!! 


Evaluation of Achievement Tests 


The sounder standardized achievement tests, at all levels, are 
Satisfactory so far as reliability is concerned. Their validity may be esti- 
mated in one or more of three ways: (1) predictive, (2) content, (3) known 
Broups. In the first instance, test scores are correlated with marks given 
by teachers, or with teachers' ratings of their pupils on a scale. These 
correlations are of moderate size, because marks and ratings are sub- 
jective. Teachers’ marks and ratings are justifiable criteria, however, since 
competence and success in learning are ordinarily judged by these. Valid- 
ity may also be estimated by the use of groups of known abilities, the 
expectancy being that there will be significant differences between them 
1n test scores. Increases in percent passing each item in successive school 
grades and increases in average total score from grade to grade are also 
Used as validity criteria. Eo 

Content validity is a matter of expert judgment. In this instance, the 
question to be answered is: Does the test, in the judgment of specialists, 
measure the stated educational objectives of instruction in a given field 
Of study? If it is the stated aim of a course, at any level, to provide in- 
dividuals only with information, or verbal skills, or numerical skills, 
then a test designed to measure these educational results will be judged 
accordingly. On the other hand, if the purpose of instruction in a given 
Subject is to develop ability to evaluate materials, make generalizations, 
reason deductively, etc., then a standardized educational test will have to 
be evaluated in these terms. Content validation means, in effect, that 


A “ We assume that the tests of intelligence have been properly administered, scored, and 
Interpreted. This must be true, also, of the tests used for prognosis or diagnosis. 


514 TESTS OF EDUCATIONAL ACHIEVEMENT 


evaluation of the merit of a test rests ultimately upon the judgment of 
teachers who are specialists in the subject and also upon the judgment 
of specialists in test construction. 

In contemporary education in the United States, educational objectives 
emphasize development of concepts and attitudes, critical thinking, analy- 
sis and synthesis of subject matter, creativity, and problem solving, as 
well as the acquisition of essential information and skills. As already 
pointed out, many tests of educational achievement do not attempt to 
measure these abilities; and those that do are by their nature limited in 
the extent to which they are able to measure the attainment of these 
more complex educational objectives. The more complex tests present a 
number of problem situations, for each of which the facts are given, as 
are.the possible solutions, inferences, logical explanations, evaluations, 
or interpretations. It is the examinee's task only to select certain of these 
as being Correct, sounder, or more relevant than others. There is very 
little, if anything, that is creative, original, or spontaneous in such a 
task; the critical thinking required of the examinee is considerably re- 
duced. These devices test ability, for the most part, to discriminate among 
arguments, to recognize warranted and unwarranted assumptions, to dis- 
tinguish between the relevant and the irrelevant, etc. If well conceived 
and €xecuted, such tests are valuable and are superior to the measurement 
of information only; but their limitations must be recognized.!? 


Tests of Proficiency 


, Tests in this category deal principally with achievement in 
Occupational areas rather than in subjects studied in school. They are 
specialized achievement tests. 


€ncy tests are available in bookkeeping, shorthand, type- 


ity, originality, and ingenuity are de- 
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writing, and for some industrial and mechanical occupations. A battery 
of proficiency tests may, however, be built from existing instruments that 
were originally devised for another purpose. For example, analysis of a 
particular type of office job may show that proficiency in four areas is 
necessary: arithmetical computation, spelling, perceptual speed, and 
memory span. (immediate recall). As a first step, it might be possible, in 
this instance, to utilize, in combination, four existing tests already stand- 
ardized. But specific validation studies for this particular situation would 
be necessary. On the other hand, the nature of the job might be such 
as to require the development of new tests adapted to this specific situa- 
tion. 

Another type is the work sample, which may be in miniature or a 
simulation of the actual task; or it may be a full-scale realistic perform- 
ance. This type requires the candidate to produce a piece of work or 
perform a task as actually required by the job to be filled. Examples are 
performance on a replica of a telephone switchboard, operation of an 
airplane pilot trainer (on the ground), machine operation, typewriting, 
taking and transcribing shorthand. In work-sample testing, considerable 
interest has been shown in business education and business itself because, 
among other reasons, there are aspects of proficiency common to all types 
of jobs in typewriting, stenography, bookkeeping, and, more recently, 
business-machine operating. 

The Seashore-Bennett Stenographic Proficiency Test, used for selecting, 
training, and upgrading employees, is a work-sample test, consisting of 
five letters dictated at three different rates. The testee then transcribes 
the letters in a form that might be mailed out. The final product is scored 
for (1) neatness and cleanness of typing; (2) arrangement of the letter; (3) 
quality of stroke; (4) typing errors and erasures; (5) errors in English; (6) 
changes from the original in wording and meaning. The letters to be 
dictated and transcribed are on records, thus keeping the quality and 
Tate of dictation constant for all testees. Samples of superior and poor 
transcriptions are provided as guides for scoring. 

To have any merit beyond that of a randomly selected task chosen 
by each employer individually, work samples must be carefully selected 
to be representative of the job, and scoring must be placed on a clearly 
defined basis, with subjectivity minimized. To have a fair approximation 
to uniformity of rating procedure, appropriate check lists and rating 
Scales are used (see Chapter 21). 

A third type is the test of analagous functions. Instead of producing a 
Sample or working in a situation that is a replica of the job, the candi- 
date is tested in functions analogous to those necessary in the job itself. 

or example, if manual speed and precision are required, rate of tapping, 
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filling pegboards, and tracing with a stylus might be used as — ae 
another situation, two-hand coordination, reaction time, and color c 
crimination might be measured. i . - 
On the whole, the merit of proficiency tests is that, in spite of lac o 
normative data and adequate validation for widespread. use—which : 
true of many—they have been more carefully and analytically prepares 
than the usual tests devised by individual teachers, employment per- 
sonnel, and employers. As in other types of testing, ferentibus at 
specific (rather than general) validity of proficiency tests is necessary. 
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PERSONALITY RATING METHODS 


Definition of Personality 


This term has been variously defined because personalities are 

Complex and inclusive of all traits; hence, there is much room for dif- 
ferences in comprehensiveness of the definition. Those definitions, how- 
ever, that include only the individual's social value to other members of 
his group (that is, more or less superficial attractiveness or reputation) 
àre inadequate because they are concerned only with overt behavior, 
While they ignore the inner aspects of the personality—the motivation, 
Perceptions, feelings, reactions, attitudes, values, prejudices that are the 
basis of one's behavior. These social-value definitions are concerned only 
with What a person does and the impression made by him upon others 
in his socia] groups. Such impressions and evaluations are, of course, 
Important in an individual's life. They are evaluated by means of rating 
scales, which will be discussed in this chapter. These definitions, however, 
do not take into account the basic traits of a person, aside from what he 
actually does. His overt behavior may or may not reveal the covert 
aspects of his personality. Some parts of the personality inventories, as 
distinguished from rating scales, are intended to identify these covert 
es in order to provide the psychologist with a basis for a fuller under- 
fs ing of an individual's behavior. Whether they succeed in doing so 
ce m to be discussed. Also, projective tests, more than any other 
instrument, are intended to reveal the covert, subtler aspects of 
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personality and behavior. These methods will be dealt with in later 
ido that commends itself is as follows: A personality i E» 
product of the dynamic and characteristic organization within wet 
dividual of psychobiological structures, or systems, and their intera 

with the environment. It is these two aspects—individuality of the struc- 
tured organism and the nature of his environment—that determine n 
individual's particular adjustments to his surroundings. A personality 
is the individuality that emerges from interaction between a psycho- 
biological organism and the world in which he has developed and i 

Personality is described in terms of an individual’s behavior— his 
actions, postures, words, and attitudes and opinions regarding his ex- 
ternal world. But personality may be more basically described in terms 
of the individual's covert feelings about his external world; feelings that 
may not be apparent or discernible in his overt behavior. It is described 
also in terms of one's feelings about oneself. 

One's actual feelings about his external world and oneself may be at 
the conscious, preconscious, or unconscious level. The same is true re 
garding the consciousness levels of the reasons for these same feelings. 
In other words, a person may know why he feels as he does (conscious); 
or the reasons for his feelings may be somewhat below (figuratively) the 
level of awareness, but they can come into awareness with relatively little 
effort under appropriate stimulation (preconscious). Or the reasons may 
be so deeply submerged or blurred that they can be brought to the level 
of awareness only with difficulty, if at all (unconscious). This being the 
case then, it is desirable to have tests of personality that can probe the 
various aspects of personality and the levels of awareness as well. 

Several aspects of the foregoing definition need more explanation be- 
fore the instruments themselves are presented. 

By dynamic organization, 
do not exist independently or 
acting in an organized and c 


or self-confidence, or ascendency, etc., than 
ll evaluate the person as a whole. The reader 
ailable inventories and rating scales actually 


ons of these as; 
(11, Chap. 1). 


* For excellent discussi, 


i Chaps. 2 
and 3); and G, Minhy pects of personality, see G, W, Allport (2, Chap: 
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deal with only smaller or larger segments of the personality, a number 
of which may be portrayed on a psychological profile; but to be most 
meaningful, these segments must somehow be organized into a whole 
by the psychologist who studies the individual case. 

The term psychobiological structures connotes motives, habits, traits, 
attitudes, feelings, values, ways of thinking and acting. The word “psy- 
chobiological" is used to indicate that personality and its component 
integrals are neither exclusively mental nor exclusively biological. Rather, 
they involve psychological processes and functioning, together with their 
biological correlates. 

Interaction with the environment is made explicit in order to empha- 
Size that an individual's personality does not merely grow from within. 
It is the product of the interaction between himself as a developing or- 
Banism having certain psychological and biological needs, on the one 
hand, and, on the other, his environment that has nurtured, influenced, 
directed, satisfied, or in varying degrees failed to satisfy those needs. 


Rating Scales 


This type of scale is useful chiefly for learning what 
de on persons with whom he has come 
m contact, in respect to some specified traits or attitudes. It is a device 
that rates social value, occupational efficiency, group: status; and the 
like, in certain specified areas; it reflects the impression the subject has 
made upon the persons who do the rating. The rating of one person by 
Others is among the oldest of practices, the present psychological tools 

cing refinements of the common practice of providing letters and oral 
recomm ; I 

For rc of an individual, rating scales are submitted to 
teachers, counselors, employers, colleagues, parents, and others who have 
had sufficient contact with the person 1n question to have formed an 
Opinion based upon evidence derived from observation. Usually, of 
Course, ratings of a particular person are obtained from more than one 


Judge; for validity of ratings is thereby increased, inasmuch as subjec- 
tivity of judgment is decreased through the balancing of errors and bias. 
B a variety of traits, such as tact, gen- 


ating scale: be devised for (ME 
g s may resourcefulness, punctuality, indus- 


€rosity, Je i tiveness, dj Á 

ness, and many tHerss the number of possibilities being virtually un- 

!mited, Each scale usually includes traits to pe — sine igo: the 
purposes for which the scale is intended. 


Specific : 
ones depending upon the a i 
© terms s to dan each of the several traits being rated are 


s Purposes. 
1Mpression an individual has ma 
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often vague and may have different meanings for different judges. In 
order to minimize this problem of semantics and to make the rating 
scales more useful than they would otherwise be, it is necessary to ob- 
serve certain established principles. The following are the major aspects 
to be considered in their construction and use. 

Characteristics. Each trait should be clearly defined. This re- 
quirement is essential so that traits may be clearly and uniformly under- 
stood by all judges. This end may be achieved by giving explanations, 
synonyms, or specific instances as behavioral illustrations. 

The degree of the trait should be defined. Each trait is rated on a 
scale, most frequently of five or seven intervals. A larger number of 
intervals would require refinements of distinctions which are not often 
possible. Each step on the scale must be clarified in much the same way 
as the trait definitions themselves. Examples are given in a later section, 
under “Types.” 

Reliability depends upon extent of variation of judges’ ratings. Judges 
rating an individual on a specified trait will not always agree as to his 
score or rank. It is customary, therefore, to take the median or the mean 
of all the judgments as Tepresenting the nearest approximation to the 
true rating. If this method of averaging is to be meaningful, however, 
it is essential that the variation of the judges’ ratings shall be small, thus 
indicating reasonably close agreement. A large variation, on the other 
hand, would indicate unreliability owing to lack of clarity regarding the 
trait being evaluated, contradictory or unstable behavior of the subject, 
or undependability of some judges. An average of the judges' ratings 
without regard to their variation might be misleading or even absurd. 
For example, if, on a seven-point scale, two judges rated an individual 
at —3, two at +3, and two at zero, the mean rating would be zero (or 
average level), whereas the probability is that he is not average at all, 
In view of the wide disparity of judgments. Reliability of ratings is usu- 
ally dependent upon having a sufficient number of qualified judges, five 
to seven being the number frequently recommended, but infrequently 
obtained. The degree of agreement required before a set of judgments 
C MORE D a dad reliable is, to some extent, arbitrarily 
T GERE a o be statistically reliable, agreement among judges 

or four times as great, at least, as that obtained by 


cha : 
nce (See Chapter 2, on errors of measurement and of the correlation 
coefficient.) 


Eee it is Possible to discern, by inspection of the ratings of all 
n : Soy ^ individual judges seem to be most dependable. This is done 
ing the extent to which each trait rating, of each judge, approx 


imates the i 
average of all ratings on that trait. Similarly, it is possible to 
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identify the subjects who have been most reliably rated, in terms of 
extent of agreement among the judges. Thus, a rating scale can be useful 
in some individual cases, whereas the evidence for a group as a whole 
might be less than satisfactory. 

Methods of studying reliability of rating scales most commonly include 
the following: repeating judgments after a time interval; correlation be- 
tween ratings of two or more judges; and relationship between judges' 
ratings and self-ratings. The correlation coefficients thus found are near 
-50 and .60—much lower than would be acceptable in the case of tests of 
Beneral intelligence, specific aptitudes, or educational achievement. Occa- 
sionally, however, much higher reliability coefficients have been obtained, 
some being approximately .85. 

The relatively low reliabilities are not necessarily attributable to de- 
fects in the conception of a scale or in the phrasing of the characteristics 
to be rated. The coefficients reflect, to a considerable extent, the differ- 
ences among the judges and their unreliability. 

It has already been stated that the traits to be rated and the intervals 
9n the scale should be clearly defined. In addition, it is necessary to 
instruct respondents with regard to other aspects affecting the reliability 
of their judgments; that is, the rating of each trait should be made inde- 
pendently of other ratings of the same person (avoidance of the “halo” 
effect); ratings should be based upon adequate acquaintance with the 
subject; experience and acquaintance with a broad enough variety of 
Persons to provide bases of judgment are desirable; sincere motivation 
to provide the most reliable ratings possible is necessary. Of these, the 
"halo" effect has been shown to be among the most serious causes of un- 
reliability, One tends to overrate a person in all respects if he likes him, 
or if he is a close acquaintance. 

Determination of validity is difficult. There are few, if any, useful 
criteria of validity, in the usual sense, that can be employed in evaluating 
à rating scale, because there are only a few measured or measurable be- 

aviors available with which these scales can be compared. The validity 
of a Tating scale is assumed, in actual practice, to rest upon the judges’ 
Understanding of the meanings of the traits being evaluated and upon 
their accuracy in rating them. The principal indication of validity of 
Some Tating scales is the fact that persons using them—guidance counsel- 
ors, personnel officers, employers—find them helpful if the judges are 
carefully selected and if the ratings are conscientiously made. This last 
Condition cannot always be taken for granted. A rater might be unwill- 
mg to devote the time and care necessary for a careful appraisal. It hap- 
E LE Hetdnentiy that raters find the task aera one and, con- 

» respond with superficial evaluations. The respondent might— 
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consciously or not—identify with the person being rated. Or the judge 
might want to help this person; or, at least, not injure him. AO 
ratings may be in part determined by idiosyncracies, prejudices, and fo " 
lore. We still hear and read about the "strong jaw," the "honest face, 

the “penetrating eye," and the “intelligent brow.” h 

Overt traits are more reliably rated than covert traits. Traits that can 
be evaluated upon the basis of objective activities, upon actual past or 
present behaviors known to the judges, are more reliably rated than 
covert traits. For example, emotional expression, social acceptability, 
manifest fear and anxiety, aggressive or impulsive acts are rated with 
greater reliability than those dealing with a person's inner life and feel- 
ings about one's self. Although these ratings can be of significance in 
learning about the subject's status, they are not to be taken at face value 
so far as basic personality structure and dynamics are concerned. Overt 
behavior can be misleading. Aggréssion can be an expression of feelings 
of insecurity; excessive display may be evidence of strong feelings of 
inferiority; a high degree of conformity and agreeableness is, in some 
instances, symptomatic of deep-seated anxiety. a 

Degree of certainty of ratings should be stated. With each rating, it 18 
desirable to have the respondent state his degree of certainty (that 1$, 
very strong, strong, moderate)? It has been demonstrated that judges are 
most confident and in closer agreement on ratings at the extremes. This 
is understandable because extreme deviants are most clearly distinguish- 
able from others and are most readily characterized by the trait names 
Thus, such terms as "cooperative-uncooperative" and "introverted- 
extroverted” apply most forcibly and clearly to individuals who are 
manifestly at one pole or the other. 

Some persons are more accurately rated than others. On the whole, 
extroverted individuals are more reliably judged than introverted. The 
ratings of persons whose traits are characterized by overt behavior, rather 
than by covertness and inner qualities, will be based upon fuller, more 
representative, and more clearly perceived behavior samples. Also, it has 
been found that judges rate more reliably those persons who most Be 
semble themselves, because, it is believed, one can best empathize with 
persons whose behavior resembles his own. ay, 

Reliability of trait estimates is affected by desirability or undesirability 
of the trait. In self-ratings there is a tendency for individuals to over- 
rate themselves in.respect to traits regarded as socially desirable. in 
rating other persons, especially friends, some judges may be similarly 1” 


fluenced, even while trying to be conscientious. In fact, there is a general 


tendency toward generosity in ratings, rather than the reverse. 


adus. TNT 
A similar practice is now often used in polling public opinion. 
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Types of Rating Scales. The two most common forms are the 
scoring and the ranking types. When the first of these is used, the subject 
is rated at a point, or level, on the scale, without direct reference to or 
comparison with other persons in his group (for example, classroom, 
fellow workers). Each point or level on the scale carries a specified score. 
On a scale of five, for example, the "average" person is scored o; the 
deviants are scored +2, +1, —1, or —2. Or the scores, to eliminate signs, 
can all be positive, 1 being the lowest, g the average, 5 the highest. 

A common variant of the scoring method is the graphic rating scale. 
The several levels, or degrees, of the trait are defined and placed at points 
along a horizontal line. The judge places a mark anywhere he chooses 
on this line, between the two extremes. Although a graphic scale theoret- 
ically permits scoring at a large number of points, such refinement and 
spurious accuracy are not warranted. The investigator or compiler of 
information will, therefore, convert each rating within a given range 
according to a predetermined numerical scheme. 

The following are examples of the numerical rating scale. " 


How emotional is the parent's behavior where the child is concerned? 

Constantly gives vent to unbridled emotion in response to child's 

behavior. 

— — —— Controlled largely by emotion rather than by reason in dealing 
with child. 

—______Emotion freely expressed, but actual practice is seldom disorgan- 
ized. 

— — —— Usually maintains calm, objective behavior toward child, even in 


face of trying situations. 
— Never shows any sign of disorganization toward child. 


Does he get others to do what he wants done? 
— — —— Displays marked ability to lead his fellows. 
—— Sometimes leads in important affairs. 
— —  —— Sometimes leads in minor affairs. 
— Lets others take the lead. 
—  — — Probably unable to lead his fellows. 
—— — No opportunity to observe. 


Each rater checks what he believes to be the correct description. The 
Investigator then will convert the check marks into scores. In the first 
Instance, the third item would be the average and would be scored zero; 
the first and second would be —2 and —1, respectively; the fourth and 
fifth, +1 and +2, respectively. 

The graphic type of item may be illustrated by the following, re- 
ie embering the necessity of definition of the trait and clarification of 
€vels, 
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Quality of work 


l | l | l | l J 


Of doubtful Not quite up Satisfactory. Superior to proc 
satisfaction. to standard. general run. ally high. 


Attitude toward others 


L | | | | | | l J 


Quarrelsome, At times Ordinarily Always con- ponent 
uncooperative, difficult tactful, co- genial and pats zi 
upsets morale. to work operative, cooperative. csl 
with. and self- operatio 
controlled. and group 
morale. 


The ranking scale is used with persons who are ma aA 
single group and who are to be rated relative to one another. he ds i 
arranges the names in serial order with regard to each one's s fe 
a specified trait. Usually, the judge is instructed first to select wes 
dividuals to be ranked highest, lowest, and average, and then to pia E 
the others in relation to these three. Since intervals between iq 
individuals are not equal, and since it is impossible by this methoc : 
determine the sizes of the intervals, arithmetical and statistical n 
putations are not warranted and should not be attempted. The Ee 
scale simply provides a method for use with a single group of sub] 
when, for any valid reason, intragroup comparisons are desired. iit 

Another method, frequently used in rating high school and co i 
students, is to place each individual according to his percentile or E ^ 
(sometimes quartile or quintile) position in his group.? Thus, a rae 
school senior applying for college admission, or a college senior applying 


aif 3 i the 
for admission to a graduate school, might be rated according to 
following scheme. 


Ranks in the highest 5 per cent of his group. 
in the second highest 5 percent of his group. 


in the highest 25 percent, but not in the top 10 percent. 
in the third quartile group. 


in the second quartile group. 
in the first (lowest) quartile group. 


When it is desirable to learn whether certain specified traits or be- 
haviors are present or absent, a check list is used. This consists simply 
of a number of statements; each one that applies to the person being 
rated is checked. At times, such a list may be scored 4-1 for a favorable 


* See Chapters 2 and 6 for explanations of these indexes. 
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item, —1 for an unfavorable item, and o for a neutral one. Scores of 
check lists, however, have little meaning, except as a gross indication of 
a trend. Anyone using these devices will find more information and 
value in analyzing and interrelating the items that have been checked 
(as present) and those left unchecked (as absent). The Vineland Social 
Maturity Scale (see below) is a carefully constructed and unusual ex- 
ample of a check list in that it is a standardized scale. The usual list takes 
the following form. 

Handles others well; gets cooperation. 

Gives evidence of sound decisions. 

Usually well-balanced emotionally. 

Cooperates willingly when others direct. 

Satisfactory standards can be relied on. 


A more discriminating type of check list uses statements, such as those 
above, but provides the respondent with an opportunity to indicate the 
degree to which he believes each behavior is manifested by the subject. 
This is done simply by adding to each item a series of qualifying terms, 
Such as: always, usually, occasionally, rarely, never. 

Forcep-Cuoice Irems. Each item of this type consists of two or more 
Statements or attributes (there may be two, three, or four). The respondent 
1s asked to indicate which one or two of the attributes, in each set, is 
Most descriptive of or appropriate to the person being rated. In each 
Set, of one form, all attributes may be either desirable or undesirable in 
the situation or occupation for which the subject is being evaluated. Al- 
though all statements in each set are descriptive of desirable or undesir- 
able characteristics, they do not have equal discriminative value; that is, 
they are not equally valid in helping to identify desirable or undesirable 
Personalities for the particular job or educational program under con- 
Sideration, though they might appear so to the rater. This fact is one 
of the advantages of the forced-choice method; for it reduces the pos- 
Sible effects of a favorable bias. The following four statements might 
Constitute one item from which the judge is to select two as being most 


APPropriate to the individual being evaluated. 


Is well informed in his science 

Can apply scientific fact and theory to practical situations 

Creates confidence in those with whom he'deals 

Explains the reasons for his recommendations 

In another form of the forced-choice item, within each set of four 
Statements two are favorable and two are unfavorable. Of the two favor- 


(A The forced-choice method is also used in some self-rating scales, such as the Kuder 
elerence Record and the Edwards Personal Preference Scale. These are discussed in 
pter 23. 
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able statements, one has greater validity than the other; and the same 
is true, in a negative sense, of the unfavorable statements. The rater is 
required to select the one attribute that is most applicable and the one 
that is least applicable to the subject. 

Forced-choice rating scales yield, in general, more valid results than 
do graphic scales; but of the two forms of forced choice, better results 
are obtained with items that present four favorable attributes, two being 
relevant to the criterion, while the other two are significantly less rele- 
vant. This is attributable, at least in part, to the fact that respondents 
prefer the all-favorable type of item, hence cooperate better, and to the 
fact that faking and distorting are easier on the two-favorable and two- 
unfavorable type. 

The advantages claimed for the forced-choice form are these: Descrip- 
tions of behavior and assignment of attributes are kept within a narrower 
range of standards; generosity in ratings and the “halo” effect do not 
operate as strongly as with other methods; and rater bias, in either 
direction, is reduced, since it is difficult for respondents in general to 
know which of the attributes are the more favorable or the more un- 
favorable. 

This method, however, has its disadvantages. Respondents are often 
resistant or antagonistic to forced-choice items because of the restrictions 
imposed upon them and because they believe none of the statements 
within some sets are really appropriate to the person being rated. Con- 
scientious raters generally prefer the independence and the opportunity 
to exercise judgment, as provided by graphic scales, numerical scales, and 
percentage ratings (percentile, decile, etc.). 

THE Q-Sonr TECHNIQUE. This method may be used for a variety of 
purposes when a fairly wide range of relative rankings is wanted. The 
rater might be given, for example one hundred pictures of works of art 
a muni of gen (iequeniy inser ev 

: preferred. Or he might be given a large 
number of human traits to sort out according to the degree to which he 
CR Ae peepee pem pii vion ie = 
dividual, th dlc Lite ee i ee 

» the respondent is given a list of behaviors or attributes, each 
ona separate card, which are to be sorted into a specified number of 
poke ranging from those least descriptive of the subject to those most 
a MP Ts Cr Wipe ace not only to sort these into the Due 
rns EN ih 3 E za Hy AE. to place a specified. number a 
frequency alstitburin Ea s Mo Mea is uenyonHetrcd, D 
the relative strength withi de Ue E NERA a Pre 
Obviously, this £ gth, aoun the subject, of the quality it represents: 

Y, this forced distribution is a variant on the forced-choice pro- 
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cedure. It can have little value, even with only seven groupings, unless 
the number of items or statements to be sorted is rather large, in order 
to give the distribution significance. 

There are no fixed or standardized lists of behaviors and attributes; 
individual and particularized lists may be devised for each of many dif- 
ferent purposes or situations. Since, in this chapter, the discussion is 
concerned with ratings of individuals by others, the following statements 
are given as illustrations of items that might be used in making a Q- 
Sort (4).5 

Has a wide range of interests 

Behaves in an assertive fashion 

Is introspective and concerned with self as an object; frequently self-aware 

Has insight into his own motives and behavior. 


These statements are not different from those included in check lists 
or other forms of rating. They are different only in the manner in which 
they are treated. 


The justifications offered for the forced distribution are these: 

1. It eliminates individual differences in the patterns of response among 
raters; it is a means of making ratings more nearly objective and compara- 
ble by prescribing a pattern. A 

2. It facilitates the computation of a median or mean score (position) on 
a set of attributes considered desirable or undesirable in a specified situation. 

3. It provides a convenient method for calculating reliabilities of ratings 
and validities of the items. RM k 

4. The method can yield a broad description of an individual if the 
Statements are carefully chosen and are parts of a meaningful whole. 

Adverse criticisms are these: sn m 

1. Users of the technique assume without proof that a normal distribution 
best fits the ratings of each person, whereas while statements may have 
relative applicability and significance for a person, 1t does not necessarily 
follow that an arbitrary distribution is the correct or the best fit. f 

2. The Q-Sort technique provides an estimate of the relative rating for 
each attribute; it does not give any information about the strength of each 
attribute within an individual, since each person's own typical behavior is 
used as the standard of comparison in each instance. 

3. The method requires considerable conscientious effort on the part of 
the rater and is, therefore, not feasible in most practical situations. 

4. That the Q-Sort technique yields results superior to those obtained by 
other rating methods has not been established. It does appear, however, that 
it is a useful tool in research on groups. 

Critica, Incwent TECHNIQUE. This is a technique that uses detailed 
descriptions of an individual’s behavior, regarded as favorable (effective) 
or unfavorable (ineffective) in a given situation. The purpose of the 


* This method also may be used to obtain self-ratings. See (19). 
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method is to find the traits and behaviors that contribute nd 
to successful or unsuccessful performance in a certain type o d 
undertaking, and upon the basis of the findings, to prepare EC is i 
favorable and unfavorable attributes to be used in subsequent i i 
Flanagan states: “By an incident is meant any observable human pd 
which is sufficiently complete in itself to permit inferences and i 
dictions to be made about the person performing the act. To be critical, 
an incident must occur in a situation where the purpose or intent of the 
act seems fairly clear to the observer, and its consequences are sufficiently 
definite so that there is little doubt concerning its effects” (8, p. 2). “Essen- 
tially, the procedure was to obtain first-hand reports, or reports som 
objective records, of satisfactory and unsatisfactory execution of the tas 

assigned. The cooperating individual described a situation in which 
success or failure was determined by specific reported causes" (8, p. 4): 


Since this technique originated in the armed forces of the United 
States, the following example is cited (8, p. 8). 


This officer (a Capt.) was in charge of a Base Service Section and had ons 
of the hardest jobs on the base. One of the many activities under his section 
was the Officer's Club—always a headache, and especially when the club 
officer was no good. This particular club was losing money, 
terrible and the place continually looked dirty and unkempt. 
the chef and the civilian manager (a woman) feuded from da 
and the club officer was afraid of the club manager. The B. 
realized that he must take action immediately so he wen 
and called each of the key people in separately and e 
what their duties were and what he expected them to 


he called them all together and gave them a lecture 
and service. This took care of the 


service was 
In addition 
wn to dusk 
ase Service Officer 
t over to the club 
xplained in detail 
do and how. Then 
on tact, cooperation 


personnel angle, so he spent the next 
three weeks (after his normal duty hours) reorganizing the club's stock control 


system, purchasing system and many other small club activities. At the end 
of that three-week period, the club had improved so greatly that the amount 
of business had nearly doubled. 
Following an explorato 
cation to medical educati 
(a “first approximation") 


(5). 


E. Communication with patients 


1. Taking time to talk over à problem until full understanding is 
reached 


Informing 
5 Informing 
- Honesty in 
- Failure to 


ry study of the use of this technique for appli- 
on, a rather lengthy tentative list of attributes 
Was suggested. One section of this list follows 


patient of operative findings 
patient that he has cancer 
telling patient what to ex 


give clear directions 
EUER i detending of financial obligations before a procedure is 
initiate: 


pect 


Oop oN 
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The value of the critical incidents depends upon the competence and 
insights of observers who report them and upon obtaining a series of 
reports that will permit the isolation of a comprehensive list of valid 
attributes. 

Representative Rating Scales. Several scales, representative of 
the more satisfactory ones and serving different purposes, will be de- 
scribed. 

Haggerty-Olson-Wickman Rating Schedules. These are designed for 
the detection and study of behavior problems and problem tendencies in 
individuals from nursery school through high school. Schedule A is a 
behavior-problem record, enumerating fifteen types or sources of prob- 
lems, such as speech difficulties and defiance of discipline. Each of the 
fifteen is rated from 1 to 4, depending upon frequency of occurrence. 
Schedule B is a graphic scale, consisting of thirty-five traits classified into 
four Broups: intellectual, physical, emotional, and social. These traits are 
Scored on a five-point scale. The authors provide better than usual 
evidence of validity. They report a correlation of .76 with frequency of 
referral for reasons of discipline or other action by school principals, 
which may be symptomatic of a variety of adjustment difficulties. They 
report, also, that only 10 percent of "normal" children reach or exceed 
the median score of those referred to the psychological clinic. 

The Vineland Social Maturity Scale. This scale is unique in having 
been constructed and standardized on the model of the Stanford-Binet 
Scale. It is designed for use with individuals from infancy to the age of 
30 years. 

Unlike many other scales, this one is based upon a well-defined ra- 
tionale and has been systematically constructed. Behavior items are 
Brouped at age levels, as in the Stanford-Binet. The items represent pro- 
8ressive maturation and adjustment to the environment in the following 
Categories; self-help, self-direction, locomotion, occupation, communica- 
tion, and socialization. The following examples illustrate the several cate- 
Bories, 

Self-help: Reaches for nearby objects (age 0-1) 

Self-direction: Buys own clothing (age 15-18) 

Locomotion: Walks about room unattended (age 1-2) 

Occupation: Helps at little household tasks (age 3-4) 

Systematizes own work (age 25+) 
Communication: Makes telephone calls (age 10-1 1) 
Socialization: Demands personal attention (age 0-1) 

Advances general welfare (age 25+) 


Items are scored after interviewing someone well acquainted with the 
Subject, or the subject himself. A social age is then obtained; this is di- 


vided by chronological age, yielding a social quotient (SQ). 
Although this social maturity scale shows a high correlation with in. 
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telligence-test results (about .80), Doll maintains that it is distinct enough 
in content and in the behaviors rated to warrant its use in the study of 
an individual's general development, since social age provides a basis 
upon which to proceed in his care and training. y 

Although the scale is intended for use with a normal population as 
well as with the mentally deficient, it was first conceived as an aid in the 
diagnosis of feeblemindedness. In the first instance, it was and still is in- 
tended to differentiate between mentally deficient individuals who are 
also socially inadequate, on the one hand, and, on the other hand, the 
mentally retarded who are competent to conduct their personal and 
social lives. 

The Vineland scale has had wide use in clinics for children and adoles- 
cents because, in addition to the uses already indicated, it is a valuable 
device for interviewing and counseling both parents and children. 

The Fels Parent Behavior Scales. This device provides thirty rating 
scales, in as many aspects, for assessing parental behavior toward their 


by means of home visits and inter- 
d thirty aspects of child-parent rela- 
in home, sociability of family, child- 


centeredness of family, restrictiveness of regulations, readiness of criticism, 
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tive characterizations of behavior. To be of most use to a counselor or 
a clinician, each trait rating should be accompanied by a recorded de- 
scription of the pupil's behavior in that category. 

Wittenborn Psychiatric Rating Scales. This is a highly specialized de- 
vice, being '",.. a procedure for recording the observed behavior of 
mental patients and for describing them according to their current symp- 


WITTENBORN PSYCHIATRIC RATING SCALES 
J. Richard Wittenborn 
Copyright, ©1955. The Psychological Corporation, 522 Fifth Ave., New York 36, N.Y, Copyright in Canada, 


Patient's Nome Age Sex 
Date Observations Began-— .— — — . . .  Dote of Rating 
Institution Word Ratings Made by. 


Instructions to Roter: 

|. Print the requested information in the spaces above. 

2. Each scale consists of three or four descriptive statements. For each scale, select 
the one statement which best describes the patient's behavior ond draw a circle around 
the number found at the right of that statement. 


a 


Gives no evidence of difficulty in sleeping. 

Without sedation may hove difficulty in falling asleep, or sleep is 
readily or spontaneously interrupted. 

Without sedation long periods of wakefulness at night. 

Acute insomnio; without sedatives gets less than 4 hours sleep in 24. 


Rate of change of ideas (e.g., topics of conversation) does not appear 
lo be accelerated, nor are changes conspicuously abrupt. 

Ideas may change abruptly. 

Ideas aro in the process of rapid and constant change. 

Ideas change with spontaneous and unpredictable rapidity as to 
Make sustained conversation impossible. 


* No evidence that he imagines people (who probably are wholly 
indifferent to him) have an omorous interest in him. 
Believes without justification) that certain persons have an 


amorous interest in him. d 
Believes (without justification) that a sexual union has occurrei 


Or has been formally arranged for him. 


No evidence for obsessional (repetitive, stereotyped) thinking. 
bsessive thoughts recur but can be banished without difficulty. 
tient is able to banish obsessive thoughts but only with 
difficulty, 
annot banish or control obsessive thoughts. 


'ationt appears to have a de/us/ono/ belief that he is an " 


9xtraordinarily evil, unworthy or guilty person. 


Fic, 21.1. Items from the Wittenborn Psychiatric Rating Scales. The white spaces 
indicate to which of the nine psychiatric categories the item is relevant and for 


Which it is scored. By permission. 
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toms. The Scales provide for the assignment of numerical values to; in 
dicate the presence and degree of pathological symptoms in a -patient 
(16, Manual). Fifty-two symptoms (pathological manifestations) are rated; 
each, with a few exceptions, is given a weighted score from o to 3; the 
scores are then totaled for each of nine psychiatric categories (see Fig. 
21.1, in which five of the fifty-two symptoms are rated). Since the sap- 
toms sample behavior “ordinarily considered important by psychiatrists, 
this instrument is based, primarily, upon content validity. . 

The nine categories are: acute anxiety, conversion hysteria, manic 
state, depressed state, schizophrenic excitement, paranoid condition, 
paranoid schizophrenic, hebephrenic schizophrenic, and phobic compul- 
sive. The scores for these categories may be viewed as a profile, for the 
purpose of making a clinical diagnosis. Significant clues are provided by 
high scores; ? in general, the fifty-two items assist in and focalize objective 
descriptions of patients' observable behavior. From the categories listed, 
it is quite evident that this scale should be used and interpreted only by 
professionally qualified persons. 


Other Scales. 'The reader will have observed that the scales described 
above are intended for school and Clinical use. The reason for this se- 


lection is that the better devices of this type are to be found in schools 
and clinics. Numerous ratin 


n subjected to as thorough scrutiny as they 
should be. 


The following items, used in indust 
purposes. They are typical of those th 


Relations with other supervisors: 
a. Often not satisfactory 
b. Sometimes not satisfactory 
c. Usually gets along well 
d. More satisfactory than average 
€. Exceptionally satisfactory 
Knowledge of the characteristics and abilities of subordinates: 
a. Knowledge markedly limited : 
b. Knowledge somewhat limited 
c. Knows employees fairly well 
d. Knowledge better than average 
€. Knowledge exceptional 
Willingness to make difficult decisions: 
a. Often “passes the buck" 
b. Inclined to “pass the buck" 


Ty, are presented for illustrative 
at are widely used. 


* This fact, the student will recognize, applies to various types of psychological testing. 
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c. Usually properly willing 

d. More willing than average 

€. Exceptionally willing 
Resourcefulness in mecting difficulties: 

a. Often goes to pieces 

b. Easily discouraged by obstacles 

c. Usually meets the situation 

d. More resourceful than average 

e. Nearly always finds good way out 
Ability to learn new work: 

a. Learns with difficulty 

b. Learns somewhat slowly 

c. Learns fairly easily 

d. Better than average 

e. Learns with exceptional ease and speed 


In the armed forces during World War II, rating scales were devised 
to assist in personnel evaluation in a variety of situations. These followed 
the usual principles and practices already explained.* 

Evaluation of Rating Scales. Rating scales are not tests; nor 
are they precise or objective measures; hence, their reliability cannot 
be so high as that found for other types of psychological instruments. 
Rating scales do provide a means of obtaining organized descriptions 
of behavioral traits from judges who have had ample opportunity to 
make the necessary observations. If the scale meets the specifications 
discussed in this chapter, ratings will be based upon greater uniformity 
Of trait-definition and trait-connotation than will purely individual 
ratings of traits defined by each judge independently. 

By means of an organized scale it is possible to obtain ratings on 
Specified traits that are considered essential or significant in the par- 
ticular setting where the scale is being used. Completely independent, 
9r “unstructured” ratings, by contrast, may fail to provide desired in- 
formation, 

While high reliability of ratings among judges of the same persons 
would simplify interpretations of results, moderate or even low reliabil- 
Ity coefficients do not discredit a soundly conceived scale. The reason 
for relatively low reliability coefficients may be one or a combination of 
the following two factors. (1) In spite of carefully defined traits and de- 
Brees thereof, ratings are always subject to the judges’ biases, values, and 
Standards of performance and behavior. Some degree of influence of 
these aspects of the judges’ own personalities is inevitable. (2) Behavior 
9f the person being rated may be and at times is variable in different 
types of situations. This is the case even though there is some degree of 


"See Stuit (14). 
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trait-consistency within a person. For example, an individual who s 
"self-confident" in one situation is prone to be "self-confident" in " 
other, whereas a person who is "withdrawn" is prone to behave "e hd 
in many settings. Yet that same individual is not equally self-confic ne os 
equally withdrawn on all occasions; and, in fact, there may be situati 
in which his characteristic mode of behavior is not readily apparent. 
Variability of behavior is especially true of children and adolescents 
whose personality traits are still in process of formation. In creen 
ratings, it is necessary, therefore, to know the types of situations in whic 
each judge made his observations. ; ) 
The usual criteria and standards of validity are not applicable to rating 
scales. Theirs is a matter of construct, or content, validity. The questions 
to be asked regarding the validity of a rating scale are these: Does it 
meet the specifications of a sound scale? Are the traits being rated by 
the scale significant in the setting or occupation for which the individual 
is being considered? If these two questions are answered satisfactorily, 
then the ultimate usefulness (that is, predictive validity) of the scale 
will depend upon the soundness (reliability) of the judges' ratings. 
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SITUATIONAL TESTS 


Among the more recent developments in psychological testing 
are situational tests that either test the individual in action or confront 
him with situations related to his own life, in response to which he gives 
expression to his feelings for other persons. The individual's behavior 

ated or evaluated by his peers or by judges. 

Although situational tests are not as unstructured as the Rorschach 
and the Thematic Apperception Test, they are in a degree projective 
methods; 1 for the subject, by means of them, reveals some of his per- 
sonality traits through his preference for or against certain contacts with 
others (as in sociometric tests), and through his spontaneous methods of 
dealing with life situations (preconceived by the examiner) that confront 
him (as in the psychodrama and in Office of Strategic Services tests). 


Sociometric Methods 


Description. This method, credited to J. L. Moreno as the 
innovator (15), may be defined as a technique for revealing and evaluating 
the social structure of a group through the measurement of the frequency 
of acceptance or nonacceptance among the individuals who constitute 
the group. It is an approach to the problem of stu 
relationships. This technique permits the anal 
tion and status within the group, 


dying interpersonal 
ysis of each person's post 


with respect to a particular criterion- 
* For projective tests, see Chapters 25 and 26. 
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(For example: Name the pupil in your class with whom you would most 
like to sit at lunch; name your second choice. Name the two persons in 
your class, in order of preference, whom you would choose as leader 
on a trip.) The method also reveals the organization of the group, as 
well as identifying dominant individuals, cliques, cleavages (sex, racial, 
economic, etc), and patterns of social attraction and rejection. The 
reasons for the existing patterns of attraction and avoidance can then 
be determined if the personality traits of each individual are known and 
the values of the group as a whole, established. 

The method is a very simple one. The sociometric test requires that 
each individual in a given group choose one or more other persons in 
that group for a specified purpose. In a schoolroom, the pupils may be 
asked to name their first and second preferences next to whom they wish 
to sit, or with whom they wish to attend the movies, or with whom they 
would like to work on a project. Or they may be asked to name one or 
more individuals in the group who possess certain specified traits, such 
as the opposites "'talkative-silent," "neat-unkempt," etc. Sociometric tests 
were used in a state training school for girls to determine with whom 
each individual would prefer to live or work, and with whom each would 
not want to live or work (9). The method was adopted for use in the 
armed forces in an effort to identify individuals for specific assignments 
requiring, for example, leadership and dependability. Thus, each in- 
dividual is viewed in his social relationship to the whole group. 

It is apparent that a sociometric test may be devised for innumerable 
groups and situations. The guiding principles are that each one must 
be relevant to a life situation of the group, and the items or questions 
must be such as to require each person in the group to make one or 
more definite selections revealing certain personal preferences, rejections, 
or values, 

As an illustration, we take a class of seventeen pupils—seven girls and 
ten boys—in a school grade. They are asked to name the two pupils with 
whom they would prefer to sit at lunch. After the information is ob- 
tained, a sociogram is constructed. Of the several kinds of sociograms 
that have been suggested, the one shown in Figure 22.1 commends itself 
because it is simply made and easy to interpret. It is known as the 
“target technique," having been described by Northway (20). 'There are 
four concentric circles; acceptability scores, based on total number of 
Choices received by each person, are divided into four groups; the lowest 
quarter is on the outside of the target and the highest are in the center. 
Each individual is represented on the target according to his acceptability 
Score. The arrows show which individuals have been selected by whom. 
Solid and broken lines indicate first and second choices, respectively. It 
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is possible, also, to divide the target in various ways in order to show 

cleavages. In this figure, the vertical line readily shows the intersex 

choices. 0l f 

Usually, an individual's sociometric score is simply the pe o 

i i i n 

mentions he receives, or the percentage of mentions he receives from 
others in the group. 


Fic. 22.1, 


Sociogram of an elementary school class 


names of others in the gro 
the "word picture.” One 
description suits him. T: 
each trait. For example: “H 
in class; he (or she) moves around in his (or her 
walks around.” “Here is someone who can work quietly without moving 
around in his (or her) seat.” 

In the research reported b 


y Tryon (23), twenty pairs of items were 
used, describing the extremes 


of twenty traits, as follows: 


A— — 
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restless—quiet 

talkative—silent 

attention-getting—non-attention-getting 

bossy—submissive 

unkempt—tidy 

fights—avoids fights 

daring—afraid 

leader—follower 

active in games—sedentary 

humor (regarding self)—humorless (regarding self) 

friendly—unfriendly 

popular—unpopular 

good-looking—not good-looking 

enthusiastic—listless 

happy—unhappy 

humor (jokes)—humorless (jokes) 

assured (with adults)—shy (with adults) 

assured (in class) —embarrassed (in class) 

Brown-up—childish 

older friends (preference for)—younger friends (preference for) 

An individual's score on a given trait is determined by the number 
of times he is mentioned by his classmates on the pair of opposed cu 
The item in each pair designating activity or a favorable trait is oe 
à positive score, whereas the opposed item is given à cane A 
An individual’s score on each trait is the algebraic sum of dep ps en 
negative mentions received. In order to equate for the size of ch 
the algebraic sum of mentions received by each child on E n is 
converted into a proportion of the class voting for him. (Self-mentions 
are not included. ] 

A recent GERE instrument is the Syracuse go of a ES 
tions (6), separate forms of which are available for e NU ME M 
and 6), junior high school (grades 7.9) 808 Sines = im <a and 
10-12) pupils. The scales are based upon, an p 2h Jen ei 
Specific psychological needs at each of the three levels. , 


succorance 


ntary: in 
PU achievement-recognition 
junior high school: succorance 
senior high school: deference 
succorance 
playmirth 


? These needs, the authors of the scales indicate, were based upon the research of 


* A. Murray and his colleagues (18). P 
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These particular needs were selected “. . . because they provide the 
most vital kind of information . . ." ol ^ : 
The procedure is the same at all three levels. Each individual ". . . is 


asked to rate his classmates as possible sources of aid when he is troubled 
by some personal problem" (succorance). Under "need-achievement," each 
pupil ". . . rates each of his classmates as a possible source of support 
in his effort to attain personal goals whose attainment will bring social 
approval and commendation." Under "deference," fellow pupils are 
rated as ideals to look up to; while under "playmirth," they are rated 
according to the extent to which their company would be enjoyed at 
a party or other forms of recreation. The evaluations by each pupil 
are made ". . . with reference to a scale of ‘all persons he has ever known 
which permits inter-individual and inter-group comparisons of need- 
satisfaction expectations” (see Fig. 22.2). 

The results obtained with these scales are expected to indicate the 
extent to which each individual feels favorable toward his classmates. 
The scores are intended to provide answers—at least tentative ones—to 


such questions as these: Does a particular pupil feel comfortable with 
his classmates? Does he feel he is part of the group and 


» are the social relations within the class? Answers to 
these questions, it is held, should provide a basis 
with any remedial measures that m 


ngs and reratings over short intervals 
€, yielding coefficients of about .90. 
Sach subject, two sets of scores by 


of several weeks, are highly reliab] 


umber of votes received, How- 
arding individuals who clearly 
* foregoing evidence indicates 
concurrence of opinion regard- 
coefficients obtained with the 
rteen of the sixteen coefficients 


deviated from the average category. Th 
that, among the subjects tested, there is 
ing one another. Test-retest reliability 
Syracuse scales varied from .61 to -85, thi 
being above .7o. 
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The usual criteria and standards of validity do not apply to sociometric 
tests; for they do not set out to determine what are some of the actual 
personality and behavior traits of the individuals being rated. They are, 
rather, measures of the environment of opinion in which each individual 
is functioning. When children and adolescents express preferences for, or 
rejection of classmates, or when they mention classmates in connection 
with specific traits, they are not necessarily giving their own independent 
judgments. As a member of the group, each individual acquires, in some 
degree, the prevailing group attitudes toward his fellows. And as the 
object of these attitudes, each individual interacts in some manner with 
these opinions and with the persons holding them. Since the chief pur- 


ing content validity and, in some instances, con- 
struct validity. The Syracuse scales provide some d 


validity; that is, correlations between findings wit 
of esprit de corps and of "group effectiveness" o 


university (5, 6). For the first of these, the mean coefficient is 74; for the 
second it is .g2, 


ata, also, of concurrent 


Uses. In analyzing a Broup structure 
and others have u 


rocated choices,” 


by Sociometric means, Moreno 
sed the number of "isolates," "mutuals," "unrecip- 
and "stars" as indexes o 


, interrace, internation, interoccupational-leve] choices may be 
used as evidence of cleavages, 


Sociometric technique has been applied to the study of a variety of 
social situations in classrooms, f. i i 


dential communities, These inve 
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The central principle of the psychodrama is spontancity, which has 
been defined by Moreno as the ability of the subject to meet each new 
situation with adequacy, as “the most important vitalizer of living struc- 
ture." In contrast to the act of spontaneity stands the "cultural con- 
serve," which is the creative idea that has become preserved and static, 
and hence repeated and stereotyped. What the psychodrama aims at, 
therefore, is to develop in the subject the capacity to play his life roles 
in a spontaneous and always creative manner that will enable him to 
meet adequately the demands of new and evolving situations, rather than 
by employing stereotyped patterns of response. 

The psychodrama involves a director, who is the therapist or the one 
studying the personalities in the situation. On the basis of knowledge of 
his subjects and their problems, he creates the situation, selects the actors, 
assigns roles, observes and interprets the action, and acts as the link 
between actors and audience. Emphatic and active participation by the 
audience is an essential of this technique; for its members are individ- 
uals who are, or will be, in situations similar to those being portrayed 
in the act. The individual who is the subject in the drama (or the pa- 
tient, in therapy) is called the primary ego. He is the one being assisted 
in the solution of a problem of adjustment, or in learning to live a certain 
role in life. The auxiliary ego is another actor in the drama; he is the 
agent who provides the assistance needed by the primary ego. The aux- 
iliary ego does so either by (1) acting as the primary ego, identifying with 
him and representing him toward others or (2) by acting in the role 
of, and representing another person with whom the primary ego is in- 
volved, 

The foregoing are the basic procedures. A number of modifications 
and variations of the technique have been evolved. Where the two 
Persons involved in a conflict (for example, parent and child, husband 
and wife, employer and employee) are placed in a psychodramatic situa- 
tion, each may be instructed to act out his own role, or each may be 
required to act the role of the other person, as he perceives that other 
one in the specified situation. f : 

Briefly stated, the psychological rationale of the psychodrama is the 
following. In therapy, the subject, by acting, by participating in the 
Teproduction of a life situation significant to him, experiences an emo- 
tional catharsis.’ In the process, while he gains insight into his own be- 
havior, he should learn how to meet a situation adequately (sponta- 
neously and creatively) through observations of himself and through 


* Adherents of the psychodramatic technique maintain that the catharsis derived from 
it is much more genuine than catharsis derived from the method of psychoanalysis in 
Which the subject verbalizes in a remote situation (the therapist's office), dissociated 


from other persons involved in the conflict. 
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nterpretations and evaluations given by the therapist (or director) and 
members of the audience. The psychodrama, thus, is intended to be a 
learning procedure that will teach the subject how to meet each life 
situation adequately. 

In personality evaluation by means of psychodrama, the director and 
others observe and analyze each subject's characteristic ways of dealing 
with a situation commonly encountered. A wide range of themes may be 
used for testing behavior, depending upon the nature of the persons 
involved: economic problems, family relationships, social status, school 
status, play activities, levels of aspiration and selfrealization, etc. The 
number and kinds of themes are as varied as life itself. 

A variant of the psychodrama is called the sociodrama. The latter 
differs from the former in respect to purpose and emphasis. Whereas the 
psychodrama deals with interpersonal relations and adjustment prob- 
lems within the individual, the sociodrama is concerned with group 
values and group structure and thinking. The sociodrama portrays social 
phenomena and conflicts with which the audience is concerned and to 
which a solution is being sought. 

Although Moreno’s book on psychodrama was published in 1946, there 
is as yet little sound empirical evidence to establish its value as a method 
of diagnosis and treatment suitable for wide use. The method has yet to 
achieve a satisfactory level of objectivity and systematic organization in 
Tespect to techniques of observation, rating of behavior, and interpreta- 
tion of responses. Furthermore, the validity of the hypotheses regarding 
the value of the psychodrama as a technique to develop spontaneous, 
adequate, and adjusted personalities has not been demonstrated. A seri- 
ous obstacle to the experimental development and use of the psycho- 
dramatic group technique is its heavy requirement of time, personnel, 
and equipment. Perhaps this accounts for the paucity of research data. 


Office of Strategic Services: Assessment Tests 


5 Eum and Procedure. During World War II, a group 
pm i 3 and psychiatrists were given the assignment of assessing 
ONE is ae women recruited for the OSS, as it has come tO 
ete ^n e oe was to devise test procedures that would reveal 
TET. personalities and give reliable predictions of their future 

ess in this branch of military service. At the main station, the 


testing eriod lasted thr 
P! ee days; at another stat 
ion it lasted one € ) 


M Eie dead ypes of jobs in this organization were large and 
tractor C uded Script writer, base-station operator, demolitions 
, representative, section leader, resistance-group leader, 


OSS ASSESSMENT TESTS 547 


saboteur, undercover agent, liaison pilot, pigeoneer, and others. Very 
little indeed was known regarding the qualifications essential for suc- 
cessful performance in these jobs. Nor was there time and opportunity 
to make job analyses, as would be done in the case of an ordinary civilian 
situation when aptitude and personality tests for a specific occupation 
are to be devised. The assessment staff decided, therefore, to use the 
"wholistic" approach: that is, to evaluate each personality as a whole. 
This:meant that some members of the staff would provide an over-all 
evaluation and description of each individual, based upon interview; 
each candidate would be tested, observed, and evaluated in respect to 
specific traits of personality, intellect, and physique. Finally, all informa- 
tion for each individual would be assembled, organized, and interrelated 
to provide a complete description and evaluation of each candidate. On 
the basis of their unified conception of each individual's personality traits, 
' the staff estimated the probable level of future performance. For each 
recruit, then, an assignment was determined upon, using as criteria the 
statements of the qualifications required for each job as formulated by 
each branch of the OSS. 
Among the variety of devices used in the a 
were situational tests, similar in conception 
drama. A few are listed below: 


ssessment of the candidates 
to those used in the socio- 


Upon arriving at the testing station, each candidate was judged according 
to the ease with which he used the fictious name under which he went (can- 
didates did not know each other's real names, ranks, or civilian status); 
physical agility in getting off the truck. i à 

The first day, during the welcoming talk, each candidate's attitudes, 


postures, questions, and comments were noted. . 
During the first meal, each recruit's conversation Was noted (topics re- 


vealing identity were prohibited), as was ease of establishing contact with 
others, 
Various other observations 


periods, when the situations were relative , 
Terrain test: Candidates were told that at noon of the following day 


they would be tested for ability to observe the terrain of the station and its 
buildings, and from their observations, to infer the history of the farm. 
Crossing a brook: The task was for a group Md carry a delicate instru- 
ment over a "raging torrent" with "sheer banks" and to return with some 
material from the other side. Available were a few boards, a log, a heavy 
rock, lengths of rope, a pulley, and a barrel with both ends knocked out. All 
members of the group were on an equal footing, no one having been 
designated as leader. The actual setting was shallow, quiet stream, about 


eight feet wide. 
Construction: Ostens 


were made during the first evening, in the free 
ly unstructured. 


ibly a test of the candidate's ability to direct two 
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helpers in building with him a frame structure out of simple — 
materials. Actually it was a test of leadership, emotional stability, anc ps 
tration tolerance, for the two helpers were junior staff members. The jo i 
of one was to act passive, sluggish, and even as an obstacle. T he job of ne 
other was to be aggressive, offer poor suggestions, express dissatisfaction an 
criticism. . 

Stress interviews: This situation was devised to test the candidate's capac- 
ity to tolerate severe emotional and intellectual strain. Strain was m$ 
by rapid, confusing cross-questioning under disagreeable conditions with the 
aim of detecting flaws in a covering-up story which the candidate has been 
given twelve minutes to invent. . 

Poststress interview: Following the stress interview, each candidate, 
individually, was placed in a relaxed atmosphere, in the presence of a 
staff member. Casual conversation was started by the latter in an effort to 
get the candidate to "let down his guard" and thus reveal information 
about himself and his experiences at the station that should have been kept 
to himself. 

Improvisations: For each candidate appropriate dramatic situations were 
invented, to which he had to respond realistically in a psychodrama. These 
improvisations were used to test out and clarify "critical hunches" regarding 


certain personality traits of each candidate. Leadership quality, moods, 
attitudes, and modes of conducting interpersonal rel 


ationships were evalu- 
ated by this method. 


After interviewing branch chiefs to obtain their views regarding the 
necessary traits for successful performance, and after 
combining the listed traits, seven major variables emerged; and to evalu- 
ate these in each candidate, the several kinds of tests were devised, of 
which those described above are samples. 

The seven general variables considere 
OSS were: 


organizing and 


d to be basic to the needs of the 


1. motivation for assignment: war morale, interest 


2. energy and initiative: activity level, zest, effort, initiative 

3. effective intelligence: practical and efficient utilization of intelligence 
in dealing with things, people, and ideas 

4. emotional stability: steadiness, endurance, 
tions, freedom from neurotic tendencies 

5. social relations: good will, teamw 
dices and annoying traits 

6. leadership: initiative; ability to evoke cooperation, to organize, admin- 
ister, and accept responsibility 

7. security: ability to kee 
mislead 

To these seven traits, 

in selected jobs. 


in proposed job 


control over disturbing emo- 


ork, freedom from disturbing preju- 


P secrets; caution; discretion; ability to bluff and 


three others were subsequently added for use 
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8. physical ability: agility, daring, ruggedness, stamina 

9. observing and reporting: observation and accurate recall of significant 
facts and their relations, evaluation and succinct reporting of informa- 
tion 

10. propaganda skills: ability to perceive the psychological vulnerability of 
the enemy, to devise subversive techniques, and to speak, write, or draw 
persuasively 


Candidates were rated independently, by staff members, on a sixteen- 
point scale. Staff meetings were held to arrive at an “optimal characteri- 
zation and evaluation" of each candidate. The entire procedure was a 
combination of situational and other test techniques, rating scales, and 
case conferences. 

Evaluation of the OSS Tests. Since the task of the staff was to 
devise tests that would reveal personality traits for the purpose of predict- 
ing success in future assignments, it was necessary to appraise the fore- 
casting value of the procedures being used. Even in ordinary civilian 
situations, where subjects are under frequent or constant observation and 
Where their effectiveness in performance can be judged in terms of rela- 
tively concrete outcomes, assigning ratings presents serious difficulties. It 
Was to be expected, therefore, that evaluations of the performance of per- 
Sons accepted after OSS assessment would be even more difficult and less 
reliable; for these men and women were not always under close observa- 
tion in the field; it was not always possible to rate their work, because 
often the results were intangible and deferred; and, for the most part, the 
primary judges on the job were inexperienced in making psychological 
evaluations. The following results should be viewed with these con- 
Siderations in mind. . 

Four techniques of appraisal were used: (1) overseas-staff appraisal 
(2) theater-command appraisal, (3) reassignment-area appraisal, and (4) 
Teturnee appraisal.‘ In all instances, by whatever method obtained, ap- 
Praisal information, rated on a numerical scale, was correlated with 
Overall assessment ratings and with specific trait ratings that had been 
assigned after the initial testing period at the center. The validity co- 


efficients depend, among other things, upon the reliability of the ratings. 


In the case of “returnee appraisal,” the mean correlation between ratings 


Was approximately .35. As a measure of reliability, this coefficient is poor 
and indicates there were serious differences among judgments of in- 
formants regarding the same individuals being evaluated. When appraisal 


* The only o f the four types of appraisal needing clarification is the fourth. When 
OSS men inm roca field sei they were asked to rate other OSS personnel known 
to them in their areas of operation. Members of the assessment staff, under type (3), 
Tated field personnel on anxiety, dejection, homesickness, irritability, quarrels, alcohol- 


i x 
SM, Psychosomatic symptoms, and strength of complaints. 
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ratings obtained by each of the four methods were intercarmelased, me 
coefficients varied from .46 to .59, and the mean was approximately .52. 
Members of the assessment staff, however, agreed much more closely vin 
they themselves rated the field performance of individuals on the "m 
of information they had obtained from several sources about each per- 
son. In this instance, the mean reliability coefficient was approximately 
.8o. ] 

Validity of the original assessment ratings was estimated by correlating 
them with the several appraisals. The obtained coefficients were low, 
ranging from .o8 to less than .40, with one exception (.53). The most satis- 
factory correlations were found between staff assessment ratings and over- 
seas staff appraisals. This fact is attributable, first, to the staff's profes- 
sional experience and greater competence in evaluating behavior; and, 
second, to the fact that as members of the staff they were applying the 
same criteria in the rating of field behavior as they and their colleagues 
had applied in their original assessment ratings. 

More significant data on the effectiveness of the assessment procedures 
are the percentages of unsatisfactory cases in the field, found among the 
men and women who had been passed as satisfactory (high or medium) by 
the assessment staff. In reports from two centers, the percentages of un- 
satisfactory individuals from the Broups rated high or medium (com- 
bined), as found by each of the four types of appraisal were: 14.8 and 6.0 
(overseas-staff appraisal); 13.4 and 15.2 (theater commander's comments); 


11.3 and 4.5 (reassignment-area appraisal); 16.1 and 3-5 (returnee ap- 
praisal). 


The authors of the assessment study suggest that data on the predictive 


ployed were not more impressive 
& reasons: (1) defects of the ap- 
sessment methods; (3) assessment 
filled and conditions under which 


efficients. 

Sincé the termination of 
of the OSS situational tests, 
plexities inherent in the m 
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tion were used, such as projective tests, personality inventories, achieve- 
ment and aptitude tests, and interest questionnaires. Also, the period of 
observation was longer than that of the OSS—nine days as compared with 
three (or at times only one). 

The purpose of the Michigan project was to discover, if possible, the 
traits that contribute to or detract from competence and success in the 
practice of clinical psychology. Since there were few adequate criteria 
beyond academic ratings in graduate studies, the findings were incon- 
clusive. This is not surprising, inasmuch as the observed behaviors and 
the personality traits rated in the situational tests (for example, co- 
Operation, group participation, expressive movements, and ability to 
empathize) are not necessarily associated with academic and intellectual 
aspects of professional preparation. 


Evaluation of Situational Tests 


The several psychological techniques presented in this chapter 
are not of equal value, nor at comparable levels of development, nor 
applicable with equal facility. Sociometric methods are furthest along in 
development, can most readily be applied, and yield results most easily 
interpreted. They are valuable, in the hands of professional psychologists 
and other professional groups, for furnishing descriptions of group struc- 
tures and of individual status within the group. They do not provide 
information regarding the causes of the structure or status. These can be 
determined only through close study of the individuals and the com- 
munity involved. 

The psychodrama and the sociodrama are based upon the long- 
recognized value of psychological catharsis, but catharsis through activity 
rather than verbalization. The proponents of this technique also claim 
that it develops spontaneity of behavior, which promotes ‘Wholesome de- 
velopment and adjustment. This remains to be demonstrated. The psy- 
chodrama, in addition, is so devised as to provide the subjects with op- 
Portunities to gain insight into their conflicts and into the attitudes of 
Other persons involved with them in the conflict. In this respect, the 
Psychological rationale appears to be sound. The final test of the validity 
of any technique of personality diagnosis or therapy is a pragmatic one. 
And here, although a number of case studies have been reported, final 
evaluation must be deferred until adequate data are available regarding 
efficac ique. 

he iaar a used in the OSS and elsewhere are basically 
sound, from the psychological viewpoint, in that they demand activity 
of the subject in a situation that simulates the actual setting or task to be 
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performed. The tests were devised to yield evaluations of specified per- 
sonality traits. The creation of situational tests to bring out particular 
traits, their numerical rating on a scale, and the statistical study of their 
interrelations demonstrate that situational tests and psychodrama, as 
testing techniques, need not be entirely a matter of subjective judgment. 
It is apparent, of course, that the use of the situational-test technique 
is, at best, difficult. Often elaborate facilities are required, and a staff of 
experienced psychologists able to diagnose and interpret behavior is es- 
sential. The complexity and difficulty of situational testing is evident 
from the following necessary characteristics of a system of personality 
assessment, as advocated by the authors of the OSS report (21, p. 464): 


1. Social setting: the whole program to be conducted within a social 
matrix of staff and candidates, permitting frequent informal contacts and 
opportunities to observe typical modes of response to other persons. 

2. Multiform procedures: many different techniques to be employed; 
standardized tests, uncontrolled situations, performance tests, projective 
methods, and interview. 

3. Lifelike tasks: in a lifelike environment; complicated tasks requiring 
organization of thought at a high integrative level, and some of them to be 
performed under stress and in collaboration with others. 

4. Formulations of personality: collection of sufficient data to permit con- 
ceptualization of the form of some of the chief components of the personality 
of each individual; the formulation to be used as the basis in making recom- 
mendations and predictions. 

5. Staff conference: interpretations of the behavior of each individual at 
a final meeting of staff members; ratings and recommendations to be 
reached by consensus. 

6. Tabulations of assessments: formulations of personality, ratings of 
traits, and. predictions of effectiveness to be recorded for the purpose of 
statistical treatment and precise comparisons with later appraisals. 

7. Valid appraisal procedures: special attention to be devoted to the 
perfection of appraisal techniques, to determine the validity of each test in 
the assessment program and of ratings of each variable. 


It is not probable that this ideal program will be achieved even after 
a long time, or in even a few centers of personality study. In the mean- 
time, approximations have been achieved in some psychological clinics, 
where all the tests can be administered and conditions met excepting the 
situational tests themselves. As a substitute for these, psychologists and 
other qualified persons have made detailed observations and rated be- 
havior of subjects, not in "lifelike" situations but in actual life situa- 
tions: for example, children in the classroom or on the playground, teach- 
ers in the classroom, adolescents in clubs and in games, employees 2 
work, a man and wife during a discussion. Obviously, there will always 
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be some personality diagnoses and predictions that will have to depend 
upon situational tests (simulated, or lifelike) simply because it is im- 
possible to place and observe the individual in the actual situation. For 
this purpose and for personality study of individuals under controlled 
conditions, the situational-test technique holds promise.5 
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Purposes and Types 


It has been estimated that there are approximately five hundred 
personality tests and inventories. Obviously, there would be no point in 
attempting to describe all of them, or even a large number; for they all 
have much in common; and many of them, inferior in conception and 
validation, merit no attention. The inventories that are briefly presented 
in the following pages are among those that have been most widely used 
and are representative of the group as a whole. The satisfaction of these 
criteria, however, by no means implies that they are wholly satisfactory in- 
struments. They are useful within limits and in the hands of qualified 
Psychologists. 

Rating scales, for the most part, are intended to reveal how other per- 
sons (the judges) respond to or have been impressed by the subject; these 
Scales provide evidence of the value placed upon an individual in certain 
Broup situations. Personality inventories, on the other hand, are self- 
rating questionnaires that deal not only with overt behavior (for example, 
insisting on having one’s own Way, emotional expression, sympathetic 
acts), but also with the person’s own feelings about himself, other per- 
sons, and his environment, resulting from introspection (liking to be 
alone and living introvertly, need for praise, repression of desires, caution 
and worry). Insofar as inventories actually evaluate aspects of personality 
that are beyond impressions made upon observers, and resulting reputa- 
tion, they are the more valuable instruments. 
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Personality inventories may be classified into five types: those that 
(1) assess specified traits (for example, ascendance, conservatism, self- 
confidence); (2) evaluate adjustment to several aspects of the environment 
(home, school, community); (3) classify into clinical groups (paranoiac, 
psychopathic personality); (4) screen persons into two or three groups 
(psychosomatic disorders versus normal); (5) evaluate interests, values, 
and attitudes (vocational interests, scientific and economic values, atti- 
tude toward religion). 

An example of the first group is the Bernreuter; of the second, the 
California; of the third, the Minnesota Multiphasic; of the fourth, the 
Cornell Index; of the fifth, the Kuder or the Allport-Vernon-Lindzey 
Study of Values. 

Classification into five groups does not signify that the inventories in 
each have nothing in common with the others. The differences between 
them are dependent upon purposes, organization, nature of total content, 
and scoring categories. Fundamentally, nearly all personality inventories 
are based upon the principle that behavior and personality are, in part, 
manifestations of certain traits, and that the strength of traits can be 
evaluated by them. 

A trait may be defined as a generalized mode of behavior or a form of 
readiness to respond with a marked degree of consistency to a set of 
situations that are functionally equivalent for the respondent. It is a form 
of adaptive or expressive behavior employed by the individual in situa- 
tions that he perceives as having some equivalence. Thus, if a child 
readily volunteers information and opinions in classrooms but is reticent 
in all other situations, his school behavior would be regarded as a “habit” 
rather than as a “trait.” However, if this child's classroom self-confidence 
extends into a variety of situations—that is, if it is generalized—his self- 
confidence would be designated as a “trait.” Also, if a person always votes 
A oro a most conservative party, this might be only a habit; 

i ; is his characteristic mode of responding to a variety 
of situations (along a scale of conservatism-radicalism) then conservatis™ 
is one of his traits. f 

Thus, in personality inventories an effort is made to estimate the 
presence and strength of each specified trait through a number of items 
representing a variety of situations in which the individual's generalized 
mode of responding may be sampled. The traits selected for measurement 
» B Cano inventory are those present in varying degrees that can P 
RAD ong the members of the population for whom the inventory 
is intended. 

8 the content (items) of personality inventories. Thes 
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methods are not mutually exclusive. They are: (1) content validation; 
(2) known, or criterion, groups; (3) concept, or construct, validation; 
(4) factor-analysis techniques. 

Content validation was the first procedure used. The Woodworth 
Personal Data Sheet, the first of the personality inventories, was developed 
by this method during World War I for use in the armed forces.! Its 
items were based upon (1) behavioral problems and symptoms reported 
for psychoneurotic cases of various degrees of severity and (2) discussions 
with psychiatrists regarding behavior and experiences of persons in this 
group. The purpose of this inventory, used in lieu of time-consuming 
interviews, was to identify men who would be poor risks in training or 
combat. Several of the items follow. They are answered either yes or no.? 


Do you usually feel well and strong? 

Do you usually sleep well? 

Are you bothered much by blushing? 

Did other children let you play with them? 

Do people find fault with you more than you deserve? : 
Do you think you have too much trouble in making up your mind? 


The items in the inventory are concerned with the whole range of symp- 
toms of psychoneuroses: psychosomatic symptoms, excessive fears, sleep 
disturbances, obsessions, compulsions, motor disturbances, paranoid feel- 
ings, sex interests, feelings of unreality, emotional history of family, and 
other areas in which excessively deviant behavior, experience, and feelings 
are significant. : 

The Mooney Problem Check List (1950) is a relatively recent exaraple 
of content validation. Its items, covering a wide range of interests, ac- 
tivities, and concerns, were derived from case records, counseling inter- 
Views, and written reports of their own problems submitted by several 
thousand high school students. 'The principal purpose of this check list 
is to serve as a guide to organized introspection by the respondent and 
Subsequent interview by a counselor. e n 

The use of known, or criterion, groups is a familiar method, long used 
in constructing all types of psychological tests, as the reader already 
knows, In developing a personality inventory, two or more known groups 
are selected (for example, delinquents and nondelinquents, hypochon- 
driacs and nonhypochondriacs, schizophrenics and nonschizophrenics). 
The groups are given sets of questions (items) to answer; group dif- 
ferences in the answers to each question are analyzed in order to find 
those items that are significantly different statistically for each group. 


by C. H. Stoelting Co., Chicago. 


This i ; 5 
544118 inventory was published in 1919 the Woodworth PDS have been used in many 


of hese items and most of the others in 
the later inventories. 
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Examples of inventories that employed this procedure are the Minnesota 
Multiphasic Personality Inventory and the Cornell Index. The validity 
of an instrument so devised will depend upon the adequacy of the cri- 
terion groups (numbers and representativeness) and upon the soundness 
of diagnoses or classifications made by psychologists or psychiatrists. 

Concept, or construct, validation was explained in Chapter 5. When 
using this procedure, the psychologist begins with one or more per- 
sonality traits he has analyzed and defined and which he wants to evaluate. 
One of the best examples of this procedure is Maslow's Security-Insecu- 
rity Inventory. Among other traits similarly evaluated by means of in- 
ventories are introversion-extroversion, dominance-submission, confi- 
dence, sociability, and neurotic tendencies. When the author of an 
inventory starts with a conception of a trait, he devises or selects items 
that he believes fall within his definition and analysis of the trait. 

When construct validation is used, responses to the items of each scale 
(or trait variable) are analyzed in one or both of two ways. The scores 
may be intercorrelated and factor analyzed in order to identify and 
select relatively homogeneous items as content of the scale. Or total scores 
on all the items may be found for all individuals; then the obtained 
responses to each item are analyzed in relationship to the total scores 
to find the extent of agreement or disagreement of each item with the 
overall trend indicated by the total scores (see Chapter 5 on factor 
analysis and item analysis). 

If factor analysis is the method used, to begin with, the psychologist 
starts with a large pool of items. Although he does not devise and select 
items at random, he does not proceed within the restrictions set by OP 
who uses construct validation; his items are relatively heterogeneous 
rather than homogeneous. Examples of inventories based upon this pro 
cedure are the Guilford-Zimmerman Temperament Survey and the IPAT 
High School Personality Questionnaire. 

The obtained responses to each item are correlated separately with the 
responses made to each of the other items; that is, all possible pairs of 
d m palie, The resulting correlation coefficients are then factor 
analyzed to determine which item ve 
“high loadings”) to constitute a enar "This pu you h 
the content and the apparent characteristics i : : ithin 

diy PP aracteristics involved in the items W1 
YEN oe DERT e determine what aspects of personality they have 
constitute a scale for a oe rea given ‘an approprare name and b 
E ilie actor ax he evaluation of the personality trait identifie 

alysis. 

The factor analyst, in devising per: li l ibed above: 
eE GO g personality sca es as describe a E 

preconceptions of particular traits he wants 
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measure; nor does he necessarily know beforehand what they may turn 
out to be, or for which groups or types of persons each trait might be 
significant. As in other approaches, the factor analyst draws on items 
devised by his predecessors and creates some of his own. Thus, the factors 
that emerge for him are bound to have something in common with in- 
ventories constructed by other procedures. It is obvious that the number 
of factors obtained will depend upon one's resourcefulness in the num- 
ber and variety of items used (and upon the availability of mechanical 
computers). It also appears that to begin the development of a personality 
inventory by this procedure is to work without a directing or clear ob- 
jective. Any scales thus derived must subsequently be widely applied to 
learn whether they have significance or relevance to this, that, or the 
other group. From the viewpoints of both personality theory and effective 
use of personality inventories, factor analysis should follow and be an 
aid to the use of constructs and criterion groups. 

Regardless of the procedure employed initially, any scale, to be of 
significance, must ultimately demonstrate its usefulness and value with 
certain specified groups of individuals (criterion groups; external valid- 
ity); and it must be based upon, or develop, psychological concepts (con- 
Structs) that will contribute to analyses and descriptions of personalities. 


Representative Inventories 8 


The Bell Adjustment Inventory consists of questions intended 
to evaluate the subject's status in respect to home (satisfaction or dis- 
Satisfaction with home life); health (extent of illness); social adjustment 
(extent of shyness, submissiveness, introversion); emotional adjustment 
(extent of depression, nervousness, ease of disturbance); and (adults) 
Occupational adjustment (satisfaction with work, associates, and condi. 
tions). There are two forms: one for students (grade g through college) 
and one for adults. The items are of the usual kind, to be answered as 
Yes, no, or ?. For example: "Are you troubled with shyness?” "Are you 
Often sorry for the things you do?" "Do you daydream frequently?" 

The Bell inventory, based on content validity, raises a problem com- 
mon to all devices of this kind: Do the questions and the scores for each 
Category actually represent separate and distinct aspects of behavior and 
adjustment? Are these aspects mutually exclusive? Some critics maintain 
they are not. They hold, on the contrary, that the same personality vari- 
ables influence adjustment in all situations and, therefore, that the more 
Useful and significant inventories are those that probe the various psycho- 
logical mechanisms such as hysteria, defense (for example, rationalization 


* Data on validity and reliability of these inventories are given in Tables 23.1 and 23.2, 
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and projection) and escape techniques (for example, negativism, Sup- 
pression), and psychosomatic manifestations. Other psychologists, while 
recognizing the instrument's inability to reveal the dynamics of behavior, 
nevertheless believe it is useful in placing the individual relative to a 
group in respect to the specified areas of behavior, and as a basis for 
further psychological interviewing. While the first criticism is warranted, 
the Bell inventory has found wide and justified use for the latter purpose. 

The Bernreuter Personality Inventory is a questionnaire for use in 
grades g to 16, and with adults. Although the items are not arranged into 
categories, they are scored for six traits: neurotic tendency, self-sufficiency, 
introversion—extroversion, dominance-submission, confidence, and socia- 
bility. The last two of these were added by J. C. Flanagan after factor 
analysis. The items themselves and the manner of answering (yes, n0» ?) 
are not new. The method of scoring was, however, not typical; each of 
the responses to each item is regarded as characteristic of several traits, 
the scores for each item being weighted on the basis of empirically or 
statistically determined differentiating power. There are thus six scoring 
scales, one for each of the specified traits. 

This inventory, starting with trait concepts, has been criticized for 
establishing arbitrary categories and for being inadequate for individual 
diagnosis. The first of these criticisms is not warranted, since the cate- 
gories are widely recognized and accepted personality traits. Its principal 
value is as an aid in identifying persons at the extremes of the scale, as 
an early step in their psychological study. Several sample items follow: 
: Have you ever crossed the street to avoid meeting a person?” “Are you 
inclined to study the motives of people carefully?" “Do people ever come 
to you for advice?" 

The California Test of Personality, based on content validity to begin 
with, is ". . . organized around the concept of life adjustment aS a 
balance between personal and social adjustment.” It has five scales: 
primary, elementary, intermediate, secondary, and adult. The questions 
answered either yes or no, are grouped under the following categories: 


Personal adjustment: self-reliance, sense of personal worth, sense of pet 


sonal freedom, feeling of belonging, withdrawing tendencies, nervous 
symptoms. 


Social adjustment: social standards, social skills, antisocial tendencies 


fami i i i 
mily relations, school relations, occupation relations (adult level only) 
community relations. 


E twofold division is consistent with a frequent practice of 
UR adjustment difficulties into "personality" problems (person?! 
justment) and "conduct" problems (social adjustment). But it shoul 
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not be assumed that any particular focal maladjustment is restricted to 
only one or the other of these categories; for, in fact, the whole person and 
his environment are involved in behavioral difficulties or disorders. What 
these two major categories and their subdivisions do is assist in identify- 
ing some of the principal sources of an individual's problems. This in- 
ventory, like the Bell, provides an opportunity for responses that may 
be symptomatic of maladjustment and that can be valuable in subsequent 
Psychological interview and treatment. 
Several items from the intermediate inventory follow. 


he job is hard? (self-reliance) 
le are mean? (sense of personal worth) 
when they have done well? 


Do you keep on working even if t 

Do you find that a good many peop 

Is it hard for you to say nice things to people 
(social skills) 

Do you often visit at the homes o 
neighborhood? (community relations) 
le, which has separate forms for men 
lowing aspects of personality: 
future possibilities); social ad- 
y; family relations (parent- 


f your boy and girl friends in your 


The Minnesota Personality Sca 
and for women, is intended to rate the fol 
Morale (belief in society's institutions and 
Justment (gregariousness and social maturity : e 
child relations); emotionality (degree of stability); economic conservatism 
(degree on a scale from conservatism to radicalism). The inventory is 
devised for use in the last two years of high school, with college students, 
and “in some adult cases.” An aspect of this instrument infrequently 
found js the gradation of answers whereby the subject indicates the 


? 
Strength of his responses. Instead of the commonly used yes, no, or ?, the 
such as strongly agree, agree, un- 


Subject ; : 
zibject in this instance has five choices, : i 
cided, disagree, strongly disagree; or almost always, frequently, occa- 
Stonally, rarely pou never. The score of each item 1s weighted from one 
> : ; 

to five, corresponding to the degree of intensity represented by the choice 
9* answer, r , 
ding special mention are 


Perl TE tory nee 

haps the only parts of this inventory i i 

the. first ooi PER. ei) and economic conservatism. Both traits are 
Unusual in personal y questionnaires. In the first, whereas high scores 


are regarded as indicative of belief in society's institutions and future 
fa sibilities, low scores “usually indicate cynicism OF lack of "is in the 
are.” Sample items on morale are: “No one cue much what e 
to you.” "Court decisions are almost always just Two items on the scale 
oË economic conservatism-radicalism are; “On the whole our economic 
System is just and wise." "Poverty is chiefly the result of injustice in the 


istrib e 
tribution of wealth." 


aspects tested by the 
he particular selection of th LPE 


e five personalit 
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Minnesota inventory may appear to be a rather strange one. The au- 
thors explain their selection as being “. . . the result of . . . work on 
problems of personality measurement in a clinical personnel programi 
in the University of Minnesota. The personality aspects sampled with 
this instrument have been found valuable in identifying ". . . a sub- 
stantial proportion of adjustment problems in a large scale student per- 
sonnel program” after a number of traits and attitudes had been ex- 
perimentally investigated. 

The Minnesota Multiphasic Personality Inventory is the most elaborate 
and ambitious instrument in this field. It is not surprising, therefore, that 
it has been subjected to more research than any other one. The authors 
state in their manual that the inventory B sca designed ultimately to 
provide, in a single test, scores on all the more important phases of per- 
sonality. The point of view determining the importance of a trait in this 


subject into three groups— 
Pending upon whether he regards the 


manic), delusions, Phobias, masculinity-femininity 
and grouped to form 


Other psychologi 


items; among these are several for evaluating traits of persons within 2 
normal range of behavior. Some of the newer scales might have gener? 


applicability whereas others are to be used only with special groups (49) 
Among these derived scales are: 


General maladjustment (Gm) 
Social status (St) 
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Prejudice (Pr) 

Dominance (Do) 

Ego strength (Es) 

Control in psychological adjustment (Cn) 

Caudality: that is, clinical discrimination between brain lesions in frontal 
and parietal regions (Ga) 


It is apparent from this list of scales that the Multiphasic inventory 
was originally concerned exclusively with the clinical problem of dif- 
ferential diagnosis. This is further indicated by the fact that the scales 
were developed by contrasting normal groups with clinical psychiatric 
cases. The chief criterion of validity was the prediction of clinical cases 
against the diagnoses of a hospital staff. 

The items that constitute the inventory are not unusual. Its distinguish- 
ing characteristics are its comprehensiveness; its large number of scales 
to diagnose clinical types; the large number of research publications; and 
four unusual scores, obtained in addition to the diagnostic classifications. 
These four are a “validity score” (F), a “lie score" (L), a “question score,” 
and a “K-score.” 4 

The first of these, the “validity score,” is based upon a group of items 
that serve “, . . as a check on the validity of the whole record. . . . If 
the [validity] score is high, the other scores are likely to be invalid either 
because the subject was careless or unable to comprehend the items, or 
because someone made extensive errors in entering the items on the 
record sheet. A low [validity] score is a reliable indication that the sub- 
ject's responses were rational and pertinent." This score is obtained from 
Sixty-four items that have been answered uniformly by about go per- 
cent of normal persons and by nearly as many miscellaneous abnormal 
Subjects. It has been concluded, therefore, that a marked deviation from 
these uniform responses is an indication of invalidity of other responses, 
for reasons given above. , 

The “lie score" consists of fifteen items on which a high score strongly 
Suggests that the subject has answered them falsely in order to create a 
favorable impression and thereby place himself in a socially acceptable 
light. In general, these are statements about one's behavior which, if they 
apply to the subject (and they do apply to practically everyone), indi- 
cate that he is something less than perfect. ‘Though a high lie score does 
not necessarily invalidate the other scores, it may indicate that responses 
to items in general have been influenced by a tendency to lie, or mis- 
represent. Thus, the total findings would be open to question. There are, 


* Although F-scores of a group affect statistical studies of validity and one's estimate 
of the value and credence to be given to a particular individual's Scores and responses 
to the inventory the term “validity” is not used in its usual technical sense. 

> 


564 PERSONALITY INVENTORIES 


however, instances of individuals whose “lie scores" are not indicative of 
conscious falsifying, but are symptomatic of personality traits that should 
be probed; for the individual himself might be unaware of the motivation 
responsible for his false answers. . 

The "question score" is the total number of statements placed in the 
Cannot Say category. The authors of this inventory state that a high 
"question score" invalidates the others; for it is held that such high scores 
tend to move high deviate scores toward the mean. In other words, it 
appears that persons who tend to deviate from the mean (the normal 
range) more often classify statements as doubtful, whereas a correct 
classification would produce an even greater deviant score and rating. 
Here again, the subject who has a high "question score" is not necessarily 
aware of the factors responsible for it. 

The K-score is a “correction factor" used to obtain increased validity 
of the scales. Application and interpretation of the inventory revealed 
that some normal persons got highly unfavorable scores in certain areas; 
scores that indicate abnormality. An analysis was made of the items that 
were marked unfavorably by these normal persons (called “false posi- 
tives”); these items constitute the K correction score. A low K-score is said 
to indicate that the person was excessively severe in evaluating himself: 
overcandid, very self-critical, or exaggerated minimal symptoms. A high 
K-score, by contrast, represents a desire to make a more nearly normal (or 
favorable) impression: unconscious or preconscious defensiveness against 
psychological weakness, or deliberate distortion. The authors of this in- 
ventory regard the K-score as a subtle measure of test-taking attitude; 
hence, it is a score that gives further evidence of the over-all validity of 
the scores on each of the several scales. 

‘The MMPI is one of the several most widely used inventories, both 
clinically and experimentally. As is to be expected when dealing with the 
subtle and often elusive problems of personality traits, clinical and €X 
perimental findings have not been in complete agreement; and some have 
been negative. 

The authors report in their manual that a high score on a scale cor 
rectly predicted the final clinical diagnosis in more than 60 percent of 
new psychiatric admissions. This is an encouraging finding, in view o 
the lack of a high degree of agreement (low reliability) among PSY 
chiatrists who make the clinical diagnoses and classifications. An instru" 
ment's external validity, as already emphasized (Chapter 5), depends 1? 
d upon the criterion's reliability. It may therefore be said, at least 
that the MMPI is valuable in facilitating diagnosis and in describing 2? 
predicting behavior. 


Other published studies report significant agreement between scale 


| 


REPRESENTATIVE INVENTORIES 565 


scores and hypochondriasis, paranoia, schizophrenia, and, in particular, 
depressions. Also, numerous investigations have reported that scale scores 
readily distinguish between pathological groups (undifferentiated), on 
the one hand, and normal persons, on the other. In other words, the 
Multiphasic inventory is more effective when its validity findings are 
not affected by the uncertainties and unreliabilities of psychiatric classi- 
fications." 

Since the MMPI was originally devised as a clinical instrument, it 
should be used and interpreted with great caution because of the clinical 
labels attached to its scoring categories. The authors of the inventory 
have been criticized for using this traditional psychiatric classification 
which has been found clinically unsatisfactory and has long been ques- 
tioned—and, in fact, rejected—by specialists in the psychology of ab- 
normal behavior. In partial response to this criticism, code numbers 
are now used to represent an individual's profile of scores 
categories. Although this device does not get away com- 
ric labels, it has tended to place emphasis upon 
profiles, or patterns, of responses and descriptions of behavior rather 
than upon the name of a category. Thus, "male juvenile delinquents," as 
a group, get statistically reliable higher scores than nondelinquents on 
scales 4, 7, and 9 (respectively, psychopathic deviate, psychasthenia, hypo- 

consists of scores reliably higher than nor- 


mania). The "neurotic triad" | 1 
mal on scales 1, 2, and 3 (hypochondriasis, depression, and hysteria). A 


clinical case diagnosed as suffering from "anxiety state" scores quite high 
on scale 1 (hypochondriasis), and above normal, but not so high on 
scales g (hysteria), 7 (psychasthenia), and 8 (schizophrenia). This patient's 
scores on the inventory are represented in code form thus: 1/378—. 

The following two instruments are based in part upon the MMPI. The 
California Psychological Inventory, for ages 13 and older, has 480 true- 
false items (12 being duplicates), of which somewhat fewer than half are 
included in the MMPI. Unlike the latter, however, the CPI is devised 
for use with the “normal,” nonclinical population. It provides eighteen 
SCores to represent such aspects as dominance, socialization, tolerance, 
achievement via conformance, achievement via independence, intellec- 
tual efficiency. Its items were selected upon the basis of the familiar ex- 
ternal criteria, such as social-class membership, course grades, and leader- 
ship, and upon extreme groups with regard to each trait. 

The Minnesota Counseling Inventory, for use at the high school level, 


iewed and evaluated published studies on the MMPI 
ate the positive side, others the negative. One 
ies made significant group discriminations (9). 
60 studies, significant discriminations were 


(from o to 9) 
in each of the 
pletely from psychiat 


€ Psychologists who have rev 
do not agree on its merits; some accentu: 
favorable review reports that 71 of 80 studi 
An adversely critical review reports that in 1 
made in 102, or 64 percent (8, P- 166). 
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: items, many of which are similar to those in the MMPI. 
ins ae ae pes that are similar to the Bell Adjustment 
Xe and the California Test of Personality: family sept pr 
social relationships, emotional stability, conformity, adjustment to real n 
mood, and leadership. This device, like many others, was validated €: 
known groups (extreme cases), who were compared with random sample 

upils.9 l 
EE inventories based upon factor analysis are The Guilford-Zim sat 
man Temperament Survey and The IPAT High School Personality y» 
tionnaire (by R. B. Cattell et al). The first of these provides scores or 
each of the following traits: general activity, restraint, ascendance, so- 
ciability, emotional stability, objectivity, friendliness, thoughtfulness, per- 
sonal relations, and masculinity. The inventory is intended for use with 
individuals in grades g through 16, and with adults, The particular traits 
included are the products of factorial analyses made over a period of 
years by Guilford and his associates. 

The statements of these authors regarding the values and character- 


istics of their inventory are much more restrained and psychologically 
cautious than those of some other authors of inventories. 


The authors of the IPAT Questionnaire state that it “. . 
major dimensions in any comprehensive v 
vidual differences in personality" 
sions" of personality are measu 
terms already familiar in persona 
rather novel names. "They are: sc 
sociable),5 mental defect versus g 


. covers all 
iew and description of indi- 
(Handbook, page 1).7 Fourteen “dimen- 
red, some of which are designated by 
lity testing while others have been given 
hizothymia versus cyclothymia (aloof vs. 
eneral intelligence,® general neuroticism 
versus ego strength (emotional immaturity vs. maturity), phlegmatic 
versus excitable temperament, submissiveness versus dominance, desur- 
Bency versus surgency (sober vs. enthusiastic), lack of rigid internal stand- 
ards versus superego strength, threctia versus parmia (shy and sensitive o 
adventurous and insensitive), harria versus premsia (tough and realistic 
us. esthetically sensitive, dynamic sim 


plicity versus neurasthenic self- 
critical tendency (liking group action 


us. fastidiously individualistic), 


* À questionnaire 
Scale, originated in 
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confident adequacy versus guilt proneness, group dependency versus self. 
sufficiency, poor self-sentiment formation vs. high strength of self-senti- 
ment (uncontrolled and lax vs. controlled and strong will power), low 
ergic tension versus high ergic tension (relaxed composure vs. tense and 
excitable). 

The authors of this ambitious inventory, in summarizing its merits, 
state that one of its "utilities" is: ". . . omission of no research-demon- 
strated dimension of personality of importance in clinical, educational, 
or counseling practice" (Handbook, page 4). The statistical and other 
evidence on this inventory, however, do not justify the assertions in the 
two preceding quotations. Evidence of convincing functional, external 
(practical) validity is lacking. Validation data are largely in terms of fac- 
torial intercorrelations, which might or might not be significant in coun- 
seling or in clinical problems. Additional validating information is 
provided by “criterion profiles" of different groups. These profiles do not 
provide sufficient differentiation upon which clinical diagnoses could 
be made or counseling conducted, especially in view of the low reli- 
abilities of the scales. (See Table 23.2.) 

The Cornell Index was devised “. . . for the rapid psychiatric and 
psychosomatic evaluation of large numbers of persons in a variety of 
situations." The index “. . . was assembled as a series of questions re- 
ferring to neuropsychiatric and psychosomatic symptoms, which would 
serve as a standardized psychiatric history and a guide to the interview, 
and which, in addition, would statistically differentiate persons with 
serious personal and psychosomatic disturbances from the rest of the 
population. It was devised as an adjunct to the interview, not as a sub- 
stitute unless an interview is impractical.” This questionnaire, stand- 
ardized for males only, consists of 101 items. The questions fall into two 
groups: those differentiating sharply between persons with serious per- 
sonality disturbances (for example, D06 worrying continually get you 
down?") and those concerned with significant bodily Symptoms (for ex- 
ample, "Do you usually have trouble in digesting food?"). The questions 
are undisguised and often extreme (“Are you keyed up and jittery every 
Moment?” “Are you a sleepwalker?"). They must be answered either 
yes or no. í P 3 

The authors of the Index report that it has been effective in showing 
the presence of anxiety states, hypochondriasis, asocial trends, convulsive 
disorders, migraine, asthma, peptic ulcers, and borderline clinical syn- 
dromes. It is to be noted that this inventory, unlike the Bernreuter, the 
Minnesota, and others, does not provide separate scoring scales and norms 
for specific personality traits or disorders. Its scores for the entire inven- 
tory are intended only to assist in distinguishing between those having 
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serious personality or psychosomatic difficulties and NAE vf eq 
them. The scoring of the inventory is to be followed by an ir ] 
interviews, after which the diagnosis may be made. i "—À 
The 101 questions themselves have been classified under the ne oe 
ten categories, the number of items varying from one to — eem 
in adjustment expressed as feelings of fear and inadequacy; pat di cimi 
mood reactions, especially depression; nervousness and anxiety; ke ni 
circulatory psychosomatic symptoms; pathological startle reactions; v 
psychosomatic symptoms; hypochondriasis and asthenia; Se nus 
psychosomatic symptoms; excessive sensitivity and suspiciousness; wá a 
some psychopathy. This is distinctly an instrument for clinical use, chielly 
for screening purposes and expediting diagnosis. ff 
Two unusual features characterize the scoring of this inventory: euim 
scores and "stop" scores, both based upon a total of 1000 cases at military 
installations. Of these, 400 were men who had been rejected for neuro 
psychiatric reasons, and 600 were accepted after psychiatric interview 
A table of cut-off scores shows (1) the percentage of rejectees, at each score 
level, who would have been identified by the Index, and (2) the per 
centage of those accepted after psychiatric interview, but who would have 


been rejected by the Index (Table 29.1). Thus, a cut-off score of 18 would 


have identified 74 percent of those rejected after psychiatric interview, 
and would have rejected 13 


i iew, as 
percent of the men passed after interview, 
well. 


The “stop” questions (“Were you ever a patient in a mental hospita d 
are such as would indicate extreme maladjustment or pathology- 


Im ue, 0x . . H re- 
Stop" items are to be used for ready identification of men who, P 
sumably, are to be considered immediat 

The efficienc 


shown in Tabl 
stated by its 


ely for rejection. : "m 
y of this Index in identifying poor personality risks, a 
€ 23.1, is great enough to warrant its use for the pup 
authors, especially in situations where large numbers i 
persons must be rapidly screened. In situations where such pressure der 
not exist, the Index is still useful as a basis for and a guide to subseque. 
interview and to psychotherapy. The fact that the Index does not we 
larger percentages of probable poor risks at some levels, while, at he 
same time, it rejects a number of "psychiatric accepts," may be att? 


table to some inadequacy of the inventory or errors in psychiatric jode. 
ments leading to acceptance or rejection after psychiatric interview- 

is widely recognized that the brief Psychiatric screening interviews dut 
World War II were not optimally conducted; and the psychiatric E 
Viewers, in many instances, were inadequately prepared for their ta m 
The Security-Insecurity Inventory is an instrument that differs {f° 
most others in that it is dev 


4 : op 
ised to assess degrees of only one pair of OP 


TABLE 23.1 


PERCENT OF PSYCHIATRIC ACCEPTS AND PSYCHIATRIC REJECTS* 
IDENTIFIED AS REJECTS AT VARIOUS CUT-OFF LEVELS 
OF THE CORNELL INDEX 


400 psychiatric 600 psychiatric 


Cut-off level rejects accepts 

0 100% 100% 
] 99 82 
2 97 67 
3 94 54 
4 93 46 
5 92 39 
6 90 32 
7 86 28 
8 85 24 
9 83 20 
10 81 18 
ll 78 16 
12 76 15 
13 74 1 
14 72 » 
15 68 10 
16 66 E 
17 62 s 
18 61 7 
19 60 7 
20 57 6 
21 55 5 
92 53 i 
28 50 s 
e 48 S 
95 45 ; 
96 42 3 
97 41 
28 40 1 
99 39 1 
30 35 1 
34 1 


32 32 
w at five induction stations. 


* In terms of opinion at psychiatric intervie’ di cbe wet used v apo AGE 
The table reads: If cut-off score © 7 on jected by Index; 28% of 

those rejected at interview would hays been rejected by Index. 

those accepted at interview would also have Psychological Corporation. By 
OURCk: Manual of Corne 


j| Index. The 
Permission, 
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posed personality traits. The authors of this tavéntory have selected Vie 
particular traits because they believe that security is almost pen ene 
with mental health." Security is defined, essentially, as feelings o g 
liked, loved, and accepted; of belonging and having a place in the group; 
fety and of being unanxious. 

s SI bany is intended for use with groups for erpiar and 
survey purposes, and for screening college students who might nee P 
chological therapy or counseling. This is a very good illustration o 


inventory designed to evaluate a personality trait that has been defined 
explicitly and in detail (construct validity). 


Evaluation of Personality Inventories 


Reliability and Validity. The reliabilities of these inven- 
tories, as reported in their manuals, vary considerably from low, vis 
satisfactory coefficients to some (in the .8os) that are reasonably satis- 
factory, considering the traits being measured. The methods used are the 
usual ones with which the student is familiar from discussions of other 


types of tests, and which were explained in Chapter 4. It is especially im- 
portant to consider reliability coeffici 


soundness of profiles or patterns of 
It is in respect to validity, 


ents when one judges the value and 
Scores obtained on an inventory. 
however, that personality inventories as a 


TABLE 23.2 
RELIABILITY DATA oF CERTAIN INVENTORIES 
Inventory Reliability Method 
Bell Adjustment Inventory -75—97 retest 
-80-.89 odd-even 
Bernreuter Personality Inven- 78-92 ^  split-half 
tory 
California Psychological In- 70s Kuder-Richardson 
ventory formula 
-38-.87 test-retest; separate scales 
California Test of Personality 51-97 for part scores 
-80-.96 for total scores 
Cornell Index 95 Kuder-Richardson 
formula 
Guilford-Zimmerman Temper- 


-75—.85 Split-half 
ament Survey 
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Inventory 


Reliability Method 


IPAT Personality Question- 
naire 


Minnesota Counseling Inven- 


tory 
Minnesota Personality Scale 
MMPI 


Security-Insecurity Inventory 


-68-.80 test-retest 
-36-.60 split-half 
-40-.69 equivalent forms 
-70-.80 split-half 


-90 odd-even 
.56-.90 retest; normal subjects 
.52-.89 retest; psychiatric 


patients 
.84 retest 
.86 odd-even 


class present the greatest difficulties and are most vulnerable to criticism. 
Determination of validity is certainly difficult; yet that must be the most 
essential requisite of a useful instrument. 

In devising their personality measures, the earlier authors began with 
questions or statements, gathered from a variety of publications and 


TABLE 23.8 


VALIDITY CRITERIA OF CERTAIN ÍNVENTORIES 


Inventory 


Validity criteria 


Bell Adjustment Inventory 


Bernreuter Personality Inventory 


California Psychological Inventory 


California Test of Personality 


Cornell Index 


Guilford-Zimmerman Temperament 
Survey 


Other inventories; ratings of judges; item 
analysis; differentiation of extreme 
groups 

Other inventories; differentiation of ex- 
treme groups; low intercorrelations of 
part-scores 

Extreme groups; cross validation; other 
inventories 

Later clinical findings; experts’ judg- 
ments of item appropriateness; item 
analysis 

Neuropsychiatric cases: after interview; 
normal persons, accepted after inter- 
view; civilian groups; other inven- 
tories; distribution of scores in a col- 
lege population 

Low correlations between traits; factor 
analysis 
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TABLE 23.3 continued 


VALIDITY CRITERIA OF CERTAIN INVENTORIES 
ae) SEE eee eee 


Inventory Validity criteria 
IPAT Personality Questionnaire Factor analysis; criterion profiles 
Minnesota Counseling Inventory Normal group sample; known extreme 
groups 
Minnesota Personality Scale Item analysis; extreme groups; known 
groups of adjusted and maladjusted 
students 


MMPI Diagnosed psychiatric groups; normal 


subjects; amount of score overlap be- 
tween nosological groups; differentia- 
tion between unselected patients and 
: normal persons 
Security-Insecurity Inventory Other inventories; self-estimate of sub- 


jects; known groups; systematic analy- 
sis of syndromes, security and in- 


security 


sources (clinics, schools, colleges, industry and business, home, commu- 
nity), that are symptomatic of neurotic disorders, behavior difficulties, or 


them, and adding some new ones, 

METHODS. The following methods and criteri 
used in studying validity: (1) statistically significant differences between 
average scores of clinically well-defined groups; (2) significant average- 
score differences between clinical groups and a normal population; (3) 

ntiate between the two extreme groups in 
n; (4) internal consistency of items or parts; 

Sen eL 
school officials; (6) selection 7 iens Pn "ae idee id ss 
relations with these instruments; (7) factorial analysis; (8) the author's 


own judgment regarding manifestatio i s 
; Á ns which consti i eof a 
specified trait. stitute evidenc 


a have commonly been 
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as, for example, in the case of the Minnesota Multiphasic Personality In- 
ventory (depression, hysteria, paranoia, etc). Measures standardized on this 
criterion cannot and should not be used for the study of a normal popula- 
tion, except for the purpose of screening out, for further clinical study, 
individuals at the extremes of maladjustment. 

2. The second criterion is likewise used with measures that are intended 
chiefly for clinical purposes, but in this instance the emphasis is upon segre- 
gation of the normal from the abnormal, rather than differentiation among 
the abnormal themselves. 

3. The third criterion evaluates each item in regard to its effectiveness in 
distinguishing between the extreme groups of a distribution of scores for a 
single trait (for example, radical-reactionary or ascendance-submission), as 
shown by the percentage at each extreme answering the item in a specified 
manner. A test so validated should not be used for a general, representative 
population because it is not necessarily adequate to differentiate among the 
great percentage of persons who are located between the extremes. 

4. The fourth, internal consistency, differs from the preceding criterion in 
that each item is correlated against part scores for all subjects, the purpose 
being to learn whether answers to the individual items are, on the whole, 
reasonably consistent with the behavior or personality trends suggested by 
the scores. This is a form of content validity; for basic to it is the assumption 
that the total or part score actually does measure what it purports to, and 
that it is the author's task to eliminate those particular items that do not 
conform to his selected traits and to the test items as a whole. With few ex- 


it is doubtful that internal consistency may be regarded as a meas- 
e used in addition. If this is done, 


in an effort to obtain total scores 


ceptions, 
ure of validity unless external criteria ar 
then internal consistency will be sought in a ain 
that yield the highest validity coefficients against external criteria. 

5. The fifth criterion, employed in constructing inventories to T used 
principally in schools, assumes that the obtained Esca have a p 
validity and that the judges are competent to assess personality traits as we 
as desirable and undesirable forms of adjustment. In some instances the 

ion i ranted; in many it is not. : 
6 Denon assumes hat items and tests already in use are them- 
selves valid. This practice is frequently not justified and ius to perpetuate 
inadequacies, errors, and misconceptions inherent in the o. er inventories. 

7. Factor analysis, the seventh criterion, in the work of ig eb 
has taken the place of validation against behavioral s psy! e a 
analysis. As already explained, these investigators aic e a number d 
items, administer the inventory to a standardization population, dmn y 
analyze the scores, group the items into a number of ee ani a 
the categories names of traits that appear to be measured by means of the 

into the inventory in the first place. This is 


it ided should go : 
P E reasoning. Actual behaviors of defined groups of persons 


must be the ultimate criteria of validity of practically all personality inven- 
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tories. For personality traits derive their ultimate ore peut, role 
they play in advancing or retarding personal and socia fo hes eee 

8. In using the eighth criterion, an author selects or de lnc ke 
his own definition of a trait or a theory of personality, wi F wire 
thereafter over their behavioral or statistical validity, Starting wi -— 
and definitions is, of course, desirable; but the validation process m 
beyond that stage. 


REsEARCH FINDINGS. Only few personality tests have been — 
according to all, or even several, of these criteria. Most have ae 
jected to validation by the method of internal consistency, or corr ee 
with earlier tests, plus, in some instances, the use of known groups, in i 
form or another. Validation data obtained by correlations of pe scs 
consistency have yielded the most impressive results. But this 1s ready 
understandable, because items can be retained, modified, or eliminated so 


ded contradictory results (15, 16, 17). For 


personality questionnaires with 


groups of behavior-problem children, the number of correlation coeffi- 


cients of various Sizes were as follows: 


two above .7o 
one between :40 and .7o 
Six below 440 


» Correlating scores of normals and abnormals 
(diagnosed neurotics and ps 


coefficients were obtained: 


thirty-six aboye 70 
nine between 40 and .7o 
thirty below -40 


ysis is the reduction of the number of concepts in order 
ment, it does not 


anize a seem probable that the multiplication 
of entities will facilitate Personality testing. 
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When inventory scores were validated against ratings by teachers, friends, 
or associates, the findings were: 


twelve above .70 
ten between .40 and .70 
twenty-two below .40 


Validation studies of four group inventories (Bell, Bernreuter, Thurstone 
Personality Schedule, Woodworth Personal Data Sheet) yielded the fol- 


lowing results: 


twenty-five above .70 
eleven between .40 and .7o 
forty-four below .40 


More consistent and convincing results were obtained when personality 
tests—principally the Minnesota Multiphasic—were administered indi- 
vidually rather than to groups. The validity coefficients were: 


ten above .70 
three between .40 and .70 
two below .40 


These last data suggest that individual testing of personality is superior 
because subjects may be more highly motivated, owing to clinical rap- 
port; the inventories are more carefully developed; their uses are more 
limited and more clearly defined. 

Somewhat more than half of the coefficients reported above (.40 and 
higher) are either quite high or moderate as validating data. And some- 
what fewer than half are quite low (below .40). Although coefficients be- 
low .40 or .50 do not have high predictive value for all individuals within 
the group, they may, nevertheless, indicate that the inventory has value 
in identifying individuals who constitute the more deviant groups. 

The differences found among the large number of studies summarized 
cannot be attributed to the inventories alone. Other factors to be con- 
sidered are the number of subjects, their homogeneity, and their classi- 
fication; the soundness of the ratings or of the clinical diagnoses that 
are used as validity criteria; and the purposes for which, and the con- 
ditions under which, the inventories were administered. 

These findings indicate that inventories for the assessment of per- 
sonality traits should not be used indiscriminately or uncritically; nor 
should the sounder among them be rejected uncritically. Personality in- 
ventories are more valuable for certain defined populations than for 
others; they are more valuable in some kinds of situations than in others, 
One comprehensive survey of published studies between 1946 and 1951 
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reports “. . . that in most cases inventory scores discriminate pe peces 
when used with psychoneurotic, psychosomatic, alcoholic, age, -— —À : 
and college groups. . . . and they usually do not give significan E: 
discriminations when used with vocational, academic, socioeconomic, à 
disabled and ill groups" (16). i 

ADVERSE INFLUENCES UPON VALIDITY FINDINGS. A number at reasons 
have been offered in explanation of the equivocal and less Do 
validity data reported in the many studies published on the subject. 
principal reasons are briefly noted here. 


The questionnaires sample segments of the person; they do not — 
out the whole patterned, or organismic, representation of behavior. noue T 
ality cannot be described in terms of separate traits or a mere i, 
of traits. None of the available questionnaires, rating scales, and persona - 
history records are able to portray the personality as a complete, dynamic, 
organized whole. They measure—not very precisely—certain aspects of be- 
havior. They do not actually measure or assess the unit (the organism) that 
does the behaving. . 

Many studies of validity have been devoted to correlations with differ- 
ential diagnoses as criteria; but the psychiatric descriptions and classifica- 
tions are not always clearly defined or sufficiently distinct; psychiatric 
diagnoses are often not sufficiently reliable; and many clinical subjects are 
too unstable or unresponsive in the test situation. 

In testing for traits common to a population, attention may be diverted 
from the individual as a unit to the assignment of a mere rank or index to 
a segment, 


Some inventories purporting to measure two or more separate traits are 
measuring largely the same trait under more than one name. 


Differences in cultural factors will cause subjects to respond differently 
to the same question. 


A given question or statement does not have the same meaning for all 
subjects, even when clearly stated. It is a fallacy to assume that all persons 
have similar reasons for giving similar responses to an item. 


Misunderstandings of questions are due to vocabulary limitations of some 
respondents. 


Many questions cannot be answered in the yes, no, ? form. 


There is a general tendency for some subjects to overrate themselves 
("self-halo"). 


Almost anyone can falsify his replies to a questionnaire; and an inde- 
terminate number do so. 


Some subjects lack insight into their traits; others fundamentally and 
unconsciously may be different personalities from their own conscious self- 
appraisals. 


The scoring of answers to items is often based upon the test author's own 
judgments and set of values. 


On some questionnaires, either very low or very high scores, or both, may 
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be significant; but the wide middle range of scores may not be meaningful 
for differentiation and description. 

Statistical assumptions and procedures often take the place of behavior 
analysis and psychological insights. 


PosrrvE CONTRIBUTIONS. On the positive side, interest and research 
in the development of personality inventories have made the following 
contributions. 


Personality testing is still in process of development. 

Efforts to develop measures of personality traits encourage greater uni- 
formity in, and precision of, trait definition and description. 

When there is essential agreement in regard to definition of traits and 
terms, and in regard to behavior and symptoms, the use of standardized 
inventories increases the objectivity of personality ratings and descriptions. 

The use of personality measures encourages analysis of traits into their 
constituent elements, thus providing a better understanding of each trait. 
(The elements themselves, taken separately and in isolation, are not, how- 
ever, the trait.) 

In some cases when, consciously or unconsciously, persons misrepresent 
themselves by their answers on an inventory, the instrument may still be 
clinically valuable, because the fact that they have misrepresented is signifi- 
cant in understanding their personalities, by means of subsequent interviews. 

Psychometric analysis is useful as one of several clinical procedures, when 
its results are considered in conjunction with other evidence (for example, 
the individual's history and psychological interview). 

Answers to items of a questionnaire may be employed as the starting 
points of subsequent psychological interviews, since answers to various ques- 
| tions and responses to various statements may be significant in themselves, 

or they may reveal significant patterns of behavior, attitudes, and feelings. 
In such instances, the numerical scores and percentile ranks can be disre- 
f garded. Pragmatically, at the present time, a useful test of personality is 
one whose score or responses to individual items assist in identifying areas 
of actual or potential maladjustment for purposes of further, more intensive 
study and subsequent treatment. Conversely, they can help in the identifi- 
cation of areas of wholesome adjustment. At their Present stage of develop- 
ment, this is one of the most useful ways in which results of personality 
inventories can be employed. : Lr 
Personality inventories are useful in the study of group trends; that is, in 
differentiating among BIOUPs of adjusted and maladjusted, rather than 


among individuals. 
Concluding Statement. Since personality traits and attitudes 
nge, it is to be expected that inventories will be less 


f test-retest scores than, for example, scales measuring 
he use of inventories as a means of evaluating and 


may undergo cha 
reliable in terms O 
intelligence. Yet, t 
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studying personality is justifiable, but only by professional DNE is 
know the principles of their construction and their limitations, and v f 
are capable of making insightful analyses of behavior. The instrumen 
are not suitable for widespread or uncritical use with large groups. 

More basic research is needed. In addition to showing improved 
reliability and validity, the traits being evaluated should be more clearly 
defined, and the relevance of each item in a scale should be established. 
The meanings of items should be as nearly uniform as possible for all 
persons; thus, words such as “usually,” “rarely,” and “generally honid 
be made explicit. Even words like “headaches,” “leadership,” “square 
deal” do not have the same connotations for everyone. Improved, more 
finely graded means of answering would be desirable, in place of yes, 
no, ?. 

The criteria against which personality tests are to be validated should 
also be made more reliable than they are at present. If, for example, 
clinical diagnoses are used as a criterion, they should be valid. Too often 
this is not the case. In some instances “ 
final ones have been employed, or, as frequently happens, diagnosticians 
do not agree among themselves (3). Again, as another illustration, it is 
unsound to use a blanket classification such as “delinquency” or “prob- 
lem behavior” as a criterion, because there are various kinds of delin- 


quents and problem behaviors, differently motivated and occurring under 
varying conditions. 


tentative diagnoses” instead of 


In this chapter, the student has become acquainted with the range of 
traits being tested, the general purposes of personality inventories, their 
similarities and differences, and the types of items being used. But in 
view of inadequacies of available questionnaires, psychologists have been 
giving increasing attention to the study of personality by means of pro- 
jective methods. These are presented in Chapters 25 and 26. 
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INTERESTS, ATTITUDES, AND 
VALUES 


Interest Inventories 


An individual's aptitudes and abilities ordinarily are not so 
highly specific that he can be given guidance solely on the basis of apti- 
tude tests. Motivation, influenced by one's interests, values, and prefer- 
ences—in addition to aptitudes and abilities—can determine the selec- 
tion of a course of study or an occupation. Often, however, decisions are 
made through chance influences rather than through self-evaluation and 
information about a field of study or of an occupation. Several devices, 
therefore, have been developed to assist in the process of self-evaluation 
and counseling. The results obtained with them are to be used in con- 
junction with data from other sources; for while interests are positively 
correlated with aptitude and ability, the covariation is far from perfect. 

The student will observe that measures of interests, values, and atti- 
tudes do not deal with mutually exclusive, or independent, traits. These 
aspects of one's personality influence one another. 

Tue KUDER INVENTORIES. These, designed for use from grade 9 on- 
ward and with adults, are in the forms of three Preference Records: 
Vocational, Occupational, and Personal. The first of these provides scores 
(and percentile ranks) in ten vocational areas: outdoor, mechanical, com- 
putational, scientific, persuasive, artistic, literary, musical, social service, 
and clerical. The second may be scored for each of thirty-eight specific 
ns, such as farmer, newspaper editor, physician, minister, me- 


occupatio 
hitect, retail clothier. The 


chanical engineer, counseling psychologist, arc 
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third is a personality inventory, intended to evaluate "en Mar Meg 
characteristics of behavior regarded as significant for cer ^ a ka 
groups of vocations and as having differential value — thes ien 
record is scored for the following characteristics: preference us ap 
active in groups (for example, insurance salesmen, clergymen; - en 
engineers); (2) familiar and stable situations (farmers, tap lmakers, I = 
school teachers); (3) working with ideas (professors, authors, Bush 
presidents); (4) avoiding conflict (physicians, accountants, pr 
(5) directing others (lawyers, business executives, policemen). High We 1 
in each category suggest preference for the activity. It is apparent tha 
these are not personality traits as generally understood, such as those 
measured by means of the inventories described in Chapter 23. T 
behaviors evaluated by the Personal Record are evidences of or conse- 
quences of personality traits that are more nearly basic aspects or deter- 
minants of behavior! . 
The items in the three preference records are of the forced-choice 
variety (see Chapter 21). Each item consists of three statements from 


which the subject selects the one he likes most and the one he likes least; 
for example: 


Collect autographs 
Collect coins 
Collect butterflies 


Exercise in a gymnasium 
Go fishing 
Play baseball 


Be of activities, are scored to yield a profile 
€ ten vocational areas, A person's scores on each of the 
ten areas is converted into a percentile rank and the resulting profile 1s 
analyzed with a view to determining in which areas, if any, the individ- 
ual's interests and preferences are strong. 

Identification of vocati 


The items, covering a wide ran 


ntific-social service, persuasive- 
d upon actual data; many, how- 
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consistency and low correlations with the other sets. The items in the 
occupational Preference Record, however, were selected on a different 
basis; that is, by determining which items discriminate between each 


TABLE 24.1 


PERCENTILE RANKS OF MEAN SCORES IN VARIOUS OCCUPATIONS 
(KUDER PREFERENCE RECORD-VOCATIONAL) 


Occupation Outdoor Mech. Com. Sci. Pers. 
Men 
Accountants 42 27 94 48 53 
Civil engineers 65 57 71 66 18 
Lawyers and judges 41 19 48 36 57 
District forest rangers 84 50 56 54 32 
Office managers and chief clerks 37 29 77 39 57 
Sales managers 32 30 32 38 87 
Women 
Librarians 41 46 30 35 44 
Physicians 84 77 47 86 22 
Social and welfare workers 51 43 31 43 54 
Occupational therapists 70 88 32 58 26 
Trained nurses 51 57 43 69 33 
Secretaries 51 43 49 41 64 


Sales clerks 


Occupation Art Lit Mus. Soc. Gler. 
Men 
Accountants 34 63 56 36 88 
Civil engineers 67 55 42 25 55 
Lawyers and judges 43 82 61 48 61 
District forest rangers 55 54 36 53 29 
Office managers and chief clerks 45 65 55 43 66 
Sales managers 41 57 59 47 43 
Women 

Librarians 66 83 52 29 38 
Physicians 60 58 45 63 1 
Social and welfare workers 59 70 46 81 25 
Occupational therapists 81 45 61 58 14 
Trained nurses 58 46 54 75 26 

50 64 61 39 58 


Secretaries 


Sales clerks 
oo 


Reproduced by permission of Science Research Associates. 
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of the chosen occupational groups and a norm group (consisting of 1000 
men selected from among telephone subscribers in 138 communities). 
This procedure was followed by cross validations. : 
Figure 24.1 shows a profile chart with norms for boys. The numerical 
column headings indicate the ten vocational areas, while the letter head- 
ings indicate the five categories on the personal Preference Record. Table 
24.1 shows the percentile-rank equivalents of mean scores for a variety of 
occupational groups, on the vocational record. : 
Tur STRONG INVENTORIES. The Strong Vocational Interest Blank is 
available in separate forms for men and women, from age 17 onward. 
Each inventory contains 400 items dealing with likes and dislikes in 
occupations, school subjects, amusements, activities, and personality 
traits; with order of preference of activities, importance of factors affect- 
ing one's work, order of preference of men (or women) one would like 
most and least to have been, positions one would like most and least to 


hold in an organization, comparison of interests between paired items, 
and self-rating of present abilities and traits 
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16. 24.1. Profile chart, From the Kuder Profile Leaflet, 1953. By permission of 
Science Research Associates, 


INTEREST INVENTORIES 


from the two that follow: 


585 

The purpose of the inventory is to find the extent to which an in- 
dividual's interests and preferences agree with those of successful persons 
in specified occupations: forty-seven for men, twenty-eight for women. 
The inventory, scored separately for each occupation, yields ratings con- 
sidered indicative of the individual's interest in each occupation. It is 
possible, also, to score the inventory for six occupational “groupings” in 
instances where one wants to know which broad fields of occupations 
are indicated. These groupings are of questionable value, as can be seen 


Group I: artist, psychologist, architect, physician, psychiatrist, osteopath, 


dentist, veterinarian. 


Group V: YMCA physical director, personnel manager, public adminis- 
trator, vocational counselor, YMCA secretary, social science teacher, city- 


school superintendent, minister. 


It is apparent that some of the occupations within each group require 
aptitudes and other personality traits that are quite different from those 
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Fic. 24.1. Profile chart (continued) 
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Group... Key number. ies 
VOCATIONAL INTEREST BLANK FOR MEN (Revised) 


By EDWARD K. STRONC, JR. 
Professor o] Prychology, Stanford University 
Published by Stawrosa Unrvaasir Pures, Stazlord Usivenity, Califorala 


Part L Occupations. Indicata after each occupation listed below whether you would like that kind of work or Bot. 
Disregard considerations of salary, social standing, future advancement, etc. Consider sel statin iride yes ved > 
to do what is involved in the occupation. You are not asked if you would take up the occupation permaseatly, rei 
whether or not you would enjoy that kind of work, regardless of any pecessary skills, abilities, or trainiag which you may 
‘may nol possess, 

Draw a circle around L if you like that kind of work 

Draw a circle around I if you are indifferent to that kind of work 

Draw a circle around D if you dislike that kind of work : 
Work rapidly. Your frut impressions are desired here. Answer all the items. Many of the seemingly trivial asd irrelevant 
items are very useful in diagnosing your real attitude. 


1 Actor (not movie) 
2 Advertiser 


D 
D 
D 
D 
D 
ine 


7" oppo 


Part IL School Subjects. Indicate as in Part I your 


in Part 
terest when in school. 


3 oococo 


t 


of a new machine, e.g., auto 
the mach 
the coat of operation of the maching n 
Supervise the manufacture of the machine 


pace if you prefer the i 
Yor di eee 


istic. Indicate below what kind of a person you are right now and 
if the item really describes you, in the third column ("No"! if the 
it ue are not sure. (Be frank in pointing out your weak points. 
as wi ints.. 
ell as your strong points.) ae or 
of my group. 7 a W 
I steadily (do Y 


) 
ay 
te 
> 
C 


Fic. 24.2. Sample items from the Strong Vocational Interest Blank. 
Stanford University Pre 


ss, by permission. 
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of the others. What can be said about Group I, for example, is that, with 
the exceptions of artist and architect, they are concerned with biological 
sciences, in varying degrees, and with healing and the amelioration of 
suffering, although this does not apply to all branches of psychology. In 
Group V, the common elements seem to be "working with people" and 
"civic interests." Insofar as groupings assist in defining areas of interest 
and limiting the number and types of occupations to be considered, the 
procedure may be justified; but the educational and vocational counselor 
will only have made a beginning. The Manual (1959) itself, in fact, 
recommends that the inventory be scored for all of the occupational scales 


rather than for a few. 
The Strong inventory may also be scored on a scale for "specialization 


level." This scale “. . . may tentatively be interpreted as measuring a 
desire or willingness to narrow one's interests to become a specialist 
within an occupational field." Although not restricted to medicine, the 
Manual (1959) states the scale is ". . . based on items which differen- 
tiated medical specialists from physicians in general practice. A high 
score means that a person has responded to the items as specialists do; 
a low score means the reverse." 

The inventory may, in addition, be scored for “nonoccupational” 
interests, found to be useful in guidance: interest maturity; masculinity- 
femininity (based upon percentile scores of males and females on interest 
items); occupational level (differences in interests between "Jaboring 
men," on the one hand, and "business and professional men," on the 
other); studiousness ("factors which contribute to scholastic achievement 
that are not measured by intelligence tests"). 

The procedure followed in determining the weights to be given par- 
ticular responses is, in outline, as follows (54, p. 611). 

1. Blanks are filled out by an adequate sampling of an occupation, 
and by a much larger sampling of “men-in-general.” 

2. Responses to each item are tallied according to the three categories: 
like, indifferent, dislike. 

3. Frequencies of responses to each item, in each of the three categories, 
are calculated in terms of percentages. Table 24.2 presents the percen- 
tages for the first ten items on the blank, as responded to by “men-in- 
general,” and by engineers. 

4. Weights are calculated for each of the frequencies of each of the 
three responses—like, indifferent, dislike—to each item. Table 24.2 shows 
that 21 percent of “men-in-general,” but only 9 percent of engineers, 
would like to be actors. For this item, a calculated weight of 1 is ob- 
tained; and it is given a minus sign because fewer engineers marked 
like for this item. Essentially, the weight given an item depends upon the 
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TABLE 24.2 


1 EREST 
DETERMINATION OF WEIGHTS FOR AN OCCUPATIONAL INT: 
SCALE: ENGINEERING 
Differences in 


Percentage percentage poris. 
First ten items of "men-in- | Percentage between he a 
on vocational general” of engineers engineers and | for vo Pen 
interest blank tested | tested men-in-general 

Fac den he 7 5. | og E EB, | £ | B 
ctor (not 1 
ava 21 | 32 | 47| 9] 31 | 60 | 12| —1) 413] —1 : 
Advertiser 33 | 38 | 29 | 14 | 37 | 49 | -19| —1 +20| —2 21-5 
Architect 37 | 40 | 23 | 58 | 32 | 10 +21] —8|—18 2 bk E 
Army officer 22 | 29 | 49 | 31 | 33 | 36 +9| +4|—13 1 d 4 
Artist 24 | 40 | 36 | 28 | 39 | 33 +4) —1| —3 0 E 
Astronomer 26 | 44 | 30 | 38 | 44| 18/412} 0o|—12| 1| 0 ^ 
Athletic director | 26 | 41 33 | 15 | 51 | 34 | 11 +10} +1} —1 1 2 
Auctioneer 8 |27|65| 1) 16} 83} —7|—11 +18| —1 | —1 3 
Author of novel | 32 | 38 | 30 | 22 | 44 34 | —10| +6] +4] —1 1 
Author of tech- " 
nical book 31 | 41 | 28 | 59 | 82} 9|428| 19] 19| s|—1|— 


Source: E. K. Strong (54, P- 75). By permission. 


T ; ed 
differences in percentages found between the score of an oe 
sampling of persons and that of the occupation for which the item weig 


is being determined. Table 24.5 illustrates scoring and weights for the 
ten items in Table 24.2. 


Evaluation of Interest Inventories.2 
interests are not tests o 


? Besides the Kuder 
owever, are not as wel 
others, see Buros, Fifth 


and the Strong, 
l conceived or de 
Mental Measure 


: . n ey, 
several other inventories are available. Th y 


ses i he 
veloped as these two, For critical reviews of t 
ments Yearbook. 
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TABLE 24.3 


SCORES OBTAINED BY AN ENGINEER FOR ENGINEERING INTEREST; 
ALso SCORES FOR INTEREST IN FIVE OTHER OCCUPATIONS, ILLUSTRATING 
THE METHOD OF SCORING THE STRONG INTEREST BLANK 


Scores 
| for 3 
engi- Scores obtained by this 
Responses | neer- engineer on interest scales for 
First Scoring of an ing in- 
ten items on | weights for engineer terest 
the vocational| engineering to the ob- 
interest blank interest ten items | tained Life 
by | Law- insur- | xin. aie m 
d 8 ance | . a count- 
} spr yer sales- sten YMCA ant 
nee | man |. 
Actor (not 
movie) =l 0 1|—|-—|x 1| —1 -1| -2 —1 =l 
Advertiser —2 0 2| —|—|x 2 1 —1 0 —2 —1 
Architect 2|-1|-1|—|—|x =l 1 1 0 0 0 
Army officer Ep E e S ee 0 0 0| —1 0 0 
| Artist 0| 0| o|—|[x|— 0 0 -l 1 1 0 
Astronomer 1| 0|-1| x |—| — 1 0 0 2 0 0 
| Athletic direc- 
tor —1 1 0—|x|— 0 0 1 0 0 
Auctioneer —1|-1 2 — | x |= — -l 0 0 0 0 
Author of 
novel —l 1| 0|—|x|— 1 0 0 0 0 0 
Author of tech- 
nical book 8|—231|-2| x | — | = 3 0 —1| -1 sy 1 
Total 10 items +7 0 —8 0 EXE c 
Total 400 items +182 | +23 | —115| —91 | —134 | —33 
Standard score 67 36 10 11 8 16 
A B Gi c [e] [e] 


Rating 


Source: E. K. Strong (54. P. 75). By permission. 


Biven group or occupation has a greater chance of finding that type of 
activity congenial and, hence, of succeeding in it; provided, of course, 
that he also has the degree of aptitude required. The differentiating value 
of the Strong inventory is illustrated in Figure 24.3, which contrasts 
rather markedly divergent occupational groups in respect to interests. 

Although the Kuder and the Strong inventories differ in their original 
and primary conceptions, they have, in their later additions, provided 
some similarities. The Kuder Preference Record-Vocational is intended 


100 Artists 


100 Chemists 


100 Certified Public 
Accountants 


100 Personnel 
Managers 


100 Accountants 


30 35 40 
Standard Score 


Fic. istributi 
24-3. Distribution of standard scores of five occupations on the artist scale. 


From E. K. Strong (54), by permission. 
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to identify significant evidences of interests in broad vocational areas. An 
occupation or'a limited list of occupations is suggested by high scores in 
a given area. Or still another restricted list may be suggested by a com- 
bination of the scores in two or more areas. The Strong, on the other 
hand, aims primarily to provide patterns of preferences that distinguish 
one specific occupation from others. The Kuder Preference Record- 
Occupational approximates the Strong in providing scores for each of 
a large number of occupations; and the Strong somewhat approximates 
the Kuder in forming clusters of occupations into separate groupings. 
Each of these instruments, however, continues to be used primarily for 
its original purpose. 

RELIABILITY AND VaLipiTv. Reliability coefficients of the ten scores 
found with the Kuder vocational inventory, in terms of internal con- 
sistency, are quite satisfactory, varying from .80 to .95, with an average 
of about .go. Retest-reliability coefficients, however, after intervals of 
one to four years, range from about .50 to .80 for men (the average being 
about .65) and from about .60 to .80 for women (the average being 
about .68). These retest coefficients indicate that, in some instances, 
significant changes have occurred over a period of several years. It ap- 
pears, therefore, that retesting is desirable, from time to time, for coun- 
seling purposes. Long-range decisions, based upon scores obtained in 
grade 9 or 10, will be of questionable validity in more than a few in- 
stances. This finding is not necessarily attributable to defects of the 
instrument; for interests and values undergo change and may strengthen 
during adolescence, as more information and varied experiences are 
acquired. In the case of the Kuder Preference Record-Occupational, the 
internal-consistency coefficients vary from .42 to .82 (with a median of 
.62), while the retest reliabilities, for unspecified intervals, reported for 
a high school and a college group, range from .61 to .85 and .77 to .91, 
respectively. y r 

Reliability coefficients of the Strong inventory for men, using odd 
and even scores, range from .76 to .94, with a median of .88. Retest scores, 
after an interval of only one weck, yielded an average coefficient of about 
.85. Persistence of interests was estimated by retesting after varying inter- 
vals, with the following correlational results (all males): 


first tested as eleventh-grade boys; retested after 2 years .81 
*  " " college freshmen; P "o a1year .88 
$4 a P g P m “ 19 years .72 
ge ae ae Seniors; 5 years .84 

w d ial 7 " 92 years .75 


These findings indicate remarkably consistent interest trends over 
long periods.* It is not to be assumed, however, that all of the occupa- 


* No comparable data are provided for women. 
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tional scales yield equally consistent results (56). For Te 2 ^pa 
Manual reports retest results for 663 men, after an interval of 1 y : 
The coefficients ranged from .48 (public administrator) * to m 
and chemist). Here again, changes in scores are not necessarily nie : 
utable to the inventory; they may be the result of changes s t : 
persons answering the questions. This is suggested by the fact that d 
est retest coefficients were found for men in professions whose interests 
remain relatively consistent: engineers, lawyers, psychologists. It is not 
surprising that in more than a few instances changes in it rate 
over a period of years, are found. Many considerations and deter minan 
other than ability and original preferences influence occupational ehee 
which, in turn, have their influence upon subsequent interests anc 
preferences. These determinants are often subtle, unpredictable, and may 
be unknown to the persons concerned. Under these circumstances, the 
information provided by these and equivalent inventories is highly 
creditable. i 
For the Strong inventory, a number of different criteria of validity 
were used: mean scores and standard deviations of criterion (specific 
occupational) groups compared with a general sample; correlations with 
grades in schools and colleges; completion of occupational training; 
ratings of success in work; earnings (in sales work); persistence in occu- 
pations; job satisfaction; differences between occupational groups; and 
correlations with other types of psychological tests. Excepting the correla- 
tions found with tests of intelligence, educational achievement, personal- 
ity traits, and with school and college grades, these criteria, all already 


familiar to the student, have been found to be related significantly to 
the scores on the Strong inventories, 


Three important types of data wi 
findings. When mean scores of each 
pared with mean scores of “men-i 
of overlapping of scores was from 
women the range was from 43 to 

Of 137 college students who bec 


ll indicate the general nature of the 
occupational group (male) were com- 
n-general,” the range of percentages 
53 to 15, with an average of g1.5. For 
17, with an average of 35.5 
ame physicians, 64 percent, as under- 
graduates, had ratings of 4 on the physician scale; 15 percent rated B+; 
10 percent B, 8 percent B—, and 5 percent C. The probabilities of sub- 
sequently entering the occupation, indicated by inventory ratings ob- 
tained for 663 college students as shown by their occupations 18 years 
after testing, were found to be as follows. For students who scored A+, 

“The number of cases scored for this occupation was 248. 

" The percent overlapping indicates the Percentage of one group who reach or exceed 


the mean of the other, Thus, 15-percent overlap means that only 15 percent of “men-in- 
general” reached or exceeded th 


Z ; € mean score of the particular occupational group being 
used in this comparison, 
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the probabilities were 88 in 100; for those scoring A—, 74; B+, 62; B, 
49; B—, 36; C, 17. 

The foregoing validity data indicate that the Strong inventory has 
considerable value in vocational guidance with college students. It is not 
to be assumed, however, that this instrument is equally efficient, as a 
predictor, at earlier ages. As Strong states in the Manual, the Vocational 
Interest Blank is “distinctly” applicable to ages 25-55, since occupational 
interests, it is maintained, change very little during that period. He also 
states that changes are "relatively slight" between ages 20 and 25. But, 
he adds, the VIB should be used “. . . below 17 years of age only with 
relatively mature boys and girls of 15 and 16 years" (54). Regardless, 
however, of the age of the person being advised, the results of the Strong 
inventory should be interpreted and utilized only by counselors who are 
familiar with its rationale and with the techniques used in its construc- 
tion. 

Validity of the Kuder Preference Record-Vocational was estimated 
in several ways. Profiles of a large number of specific occupations were 
derived in terms of percentile ranks in the ten areas. Each profile should 
show significant peaks for those interests regarded as essential and as 
having differentiating value for each occupation. Kuder found that “mean 
profiles" for occupational groups ". . . indicate in general that the 
names assigned to the various scales are appropriate in terms of the type 
of occupation entered as well as in terms of the activities for which the 
scale is scored. Chemists are found to be high on the scientific scale, 


” 


writers on the literary scale . . ." etc. 
Scores in each of the ten areas were analyzed in relation to choice of 


curricula and of occupations; and scores were related to degree of job 
satisfaction. These criteria yielded reasonably satisfactory results. Scores 
for the areas were correlated with educational achievement; but the 
coefficients were, for the most part, in the .20s and .gos, though higher 
in a few instances. Most significant findings were those showing the rela- 
tive independence of the ten areas of preferences; that is, their inter- 
correlations. Of forty-five coefficients, thirty are negative, ranging from 
—.52 (outdoor correlated with persuasive) to —.08 (musical correlated 
with clerical). Of the fifteen positive correlations, the range was from .54 
commercial correlated with clerical) to .03 (artistic correlated with liter- 
ary) (25). 

In validating the occupational inventory, a "differentiation ratio" 
was used. First the range of scores was divided, arbitrarily, into five parts. 
The ratio of the two proportions (for the norm group and for the occu- 
pational group) falling within an indicated score range, is the differential 
ratio. This index is used to answer this question: Are the scores of the 
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occupational group significantly higher than those of the norm em 
That is, does the inventory differentiate between them? ‘Table oe i = 
trates the procedure in a case where the scores of clinical age sage 
were compared with those of the norm group. It appears that the 
ventory differentiates well for that profession (26). 


TABLE 24.4 


: 
DIFFERENTIATION RATIO: KUDER PREFERENCE RECORD-OCCUPATIONAL 
SCORED FoR CLINICAL PSYCHOLOGIST 


es 


Number in sample Number in sample - 
Range of of 200 of 200 Differentiation 
; :nt10* 
Scores psychologists from norm group ratio 

56 or higher 153 4 4-404) t 
51-55 23 8 +3 
46-50 11 15 —(no difference) 
41-45 11 27 —2 
40 or less 2 146 =] 


* In each instance, the larger number is the 
cupational group is larger, the sign is plus. 

T The ratio is more than 10 on the positive or negative side. 
Source: Science Research Associates. Reproduced by permission. 


numerator. If the number in the oc- 


Both inventories were also v 


alidated by means of item analysis. Essen- 
tially, the problem would be 


to examine an adequate number of re- 
groups, to determine which of them are 


APPLICABILITY TO DIFFERENT AGE Groups. 


With whom should such 
occupational and interest inventories be used 


? Obviously, they can be 
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valid only for persons whose lives have been long enough and varied 
enough to have included experiences providing a basis for a choice be- 
tween the alternatives presented by each item in the inventories. The 
Kuder Preference Record, concerned with broad areas of interest, is 
standardized for high-school students (beginning with grade 9), college 
students, and adults at large. The Strong Vocational Interest Blank, con- 
cerned primarily with specific occupations, is intended for ages 17 and 
over. Since the Strong inventory is based upon responses of adult men 
and women, more valid and useful results will be obtained with adults 
than with persons in their teens. Since the Kuder record has been stand- 
ardized with high-school and college students, as well as with adults, it 
may be used appropriately with adolescents. But even so, the interests, 
values, and attitudes of adolescents are still in a state of flux and are as 
yet not fully developed; hence, the results of the preference record, when 
used for guidance purposes, must be interpreted with this fact in mind. 

With either a high-school or a college student, the scores and the pro- 
file obtained with the sounder inventories in this category are useful as 
an introduction to the study of occupations that involve activities of the 
sort for which he has indicated a preference, and during interview and 
counseling to check the individual's choice of an occupation against his 
expressed interests and preferences. For purposes of guidance at the 
secondary-school and college levels, it appears that the Kuder has the 
greater value because it is less specific. 

Both instruments are intended to provide measures of motivation in 
various fields of study and work. Their application and usefulness are 
based upon the premise that release and effective utilization of an in- 
dividual's general ability and specific aptitudes are strongly affected by 
motivation and interests, that a person will work best at what he enjoys 
most. This is a psychologically valid position. However, it is doubtful 
that adolescents’ interests and preferences are actually classifiable into 
the highly specific occupations. It appears, rather, that they fall into broad 
categories, each including a group of educational and vocational interests. 
The method used in the Kuder record, therefore, directed toward iden- 
tification of general patterns (or profiles) of preferences and interests, 
seems the more appropriate in the guidance of individuals who have not 
yet reached adulthood. In any event, anyone who uses these and similar 
instruments must realize that the profiles do not present simple patterns 
of strong likes and dislikes. They provide a valuable additional source 
of information that can be added to results obtained by means of other 


psychological instruments, educational records, and psychological inter- 
views. . 
Some critics of preference inventories, and some persons who fill them 


out, have said that these devices reveal what the subjects already knew 
" 
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about themselves. Although this is true in some instances, the Sunnie 
are still useful, for they provide the means of an organized and stand- 
ardized process of stock-taking and comparison of an individual's scores 
with norms and percentile scores of known groups. 


Attitudes and Values 9 


DEFINITION OF ATTITUDE. An attitude is a dispositional readiness to 
respond to certain situations, persons, or objects in a consistent manner 
which has been learned and has become one’s typical mode of response: 
An attitude has a well-defined object of reference. For example, one's 
views regarding a class of food or drink (such as fish and liquors), sports, 
mathematics, or Democrats, are attitudes. If, however, a person's char- 
acteristic behavior is described as self-sacrificing, intellectual, liberal—or 


: $ 3 pes ; e 
the opposite of these—some of his traits are being indicated, since thes 
terms represent his 


The degree or stre 
positive through a 
sible to construct tests of innumerable attitudes. 


attitudes are based upon several 
deal with a controversial question; 
ights in regard to the question will 
us statements that are made pro and 
scaled regarding the degree to which 
question under consideration. 


ents are edited. Then they are classified 
an eleven-point scale. This is done by 
eleven piles, presumably forming a con- 
avorableness or unfavorableness of each 


Var : 
° Since attitudes and values are not Separable in actual life Situations, they are dealt 
with under a single heading. 
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apparently favorable, then that item is considered irrelevant and is dis- 
carded. Statements having approximately the same values in the scale 
should show high consistency in degree of endorsement by each subject. 
'This is essentially a simple method of item analysis. Ambiguity of an 
item is determined by the spread or range of judges' ratings in the orig- 
inal eleven-fold scale, given in terms of Q (quartile deviation). If an item's 
Q is "high," it is eliminated. 

In taking an attitude test scaled in this manner, the respondent checks 
those statements with which he agrees, his score being the median of 
the scale values of the items he has marked. Thurstone held that scales 
constructed for different attitudes by this method permit direct com- 
parison of the scores of any attitudes so measured. The validity of such 
comparison, however, has been questioned because the defined "neutral 
points" of different attitudes are not necessarily the same. Nor are the 
intervals demonstrably equal; they are only equal appearing. The Thur- 
stone method is useful if strict comparability of scores is not assumed. 

Thurstone and his students developed a series of scales, each consisting 
of statements from extremely favorable to extremely unfavorable. The 
topics included in these scales dealt with attitudes—among others— 
toward Negroes, Chinese, war, censorship, the Bible, patriotism, and 
freedom of speech. The following statements are from the scale of atti- 
tudes toward the church. The scale value of each is given in parentheses, 
low values being favorable and high values unfavorable, with a possible 


range from o to 11. 
I find the services of the church both restful and inspiring. (2-3) 


I think the church is a parasite on society. (11) 
I believe what the church teaches but with mental reservations. (4.5) 
I think the teaching of the church is altogether too superficial to have much 


social significance. (8.3) 

I believe in religion but J seldom go to church. (5.4) 

I believe the church is the greatest institution in America today. (1.7) 

Likert suggested the use of an attitude-scoring technique that is simpler 
than the Thurstone method and is regarded by many as at least as re- 
liable. Each item, or statement, in the attitude scale is followed by five 
one of which is checked by the subject. The responses, in- 
e of strength of attitude, are: strongly agree (SA), agree 
U), disagree (D), or strongly disagree (SD). (Approve- 
be used in place of agree-disagree). Arbitrary scoring 
weights of 1, 2, $, 4, 5 were assigned for the respective responses. An in- 
dividual's score on a particular attitude scale is the sum of his ratings on 
all items. The principal advantage of Likert's method, obviously, is that 
it makes unnecessary the use of a group of judges to arrange statements 


responses, 
dicating degre 
(A), undecided ( 
disapprove may 
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into categories representing degrees of favorableness or unfavorableness. 
However, since the items are selected on an a priori basis, and since the 
scoring weights are arbitrarily assigned, the use of the Likert method, 
like the Thurstone, measures attitudes only in the sense that individuals 
are given a rank order according to attitude intensity. 

The following items from the Minnesota Personality Scale (for men) 
are examples of the technique suggested by Likert. 


SA, A, U, D, SD On the whole lawyers are honest. 
9 The future looks very black. 
Education only makes a person discontented. 


Remmers and his collaborators have prepared a series of attitude scales 
that differ in construction from "Thurstone's in that each scale is intended 
to measure an attitude toward a larger group of objects, persons, or in- 
stitutions. The scales deal, among others, with national and racial groups, 
vocations, teachers, social action, and school subjects. 

THE SEMANTIC DIFFERENTIAL? 


Father 
happy -a e n... i. Mad 
hard — Ld I NEC : soft 
intelligent — Ó— x o REM NE ITTY 
Woman 
fair eu m d : CT EN eee : unfair 
active SS tt eipassiye 
superior R : 1 : a 3 : inferior 
Or 
Father: happy d : $ d $ 5 : sad 
Woman: fair 1 B : : : unfair 
Success: merit : m 5 8 2 è : chance 
FDR: wise : B $ i : : : foolish 
Labor Unions: $ E : : : t : selfish 
idealistic 
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This method has possibilities of providing information on an individ- 
ual's traits; and it may add information regarding specific sources of per- 
sonality difficulties or behavioral conflict. This is so if the stimulus vota 
and the word opposites are skillfully chosen so as to elicit significant 
responses in important areas. The stimulus words used and the polar 
words selected for each pair will depend upon their intended purpose 
(see, for example, 38, p. 43). 

Osgood and his colleagues have found, through factor analysis, that 
the numerous word-opposites they used in their experiments might be 
classified into three variables (factors). 


Evaluative: good-bad, beautiful-ugly, clean-dirty, fair-unfair, fragrant- 


foul. 
Potency: large-small, strong-weak, thick-thin, loud-soft, deep-shallow. 


Activity: fast-slow, active-passive, sharp-dull, angular-rounded. 


In the case of any individual's responses, of course, these three variables 
can occur in different combinations and at different degrees of strength. 

This technique was developed originally for experimental purposes, 
not for clinical use; but it has been found to have possibilities also in 
the latter field. While the three factors named above can be useful in 
describing and understanding some clinical cases, they are not exhaustive. 
It is highly probable that clinical use of and experiment with this tech- 
ther variables; and the likelihood is that these will 
tical with those found by other methods.’ 

SociAL DESIRABILITY. In connection with rating scales (Chapter 21), 
the forced-choice type of item was explained. It was pointed out that 
respondents often select descriptive statements because they are socially 
desirable. It has been demonstrated that when the usual type of item 
(a single statement) is used, there is a strong tendency to describe one- 
self in socially desirable terms. Edwards (12) reports correlations in the 
‘80s between the frequency with which descriptive items are selected and 
their estimated social desirability. This result has been called the “façade 
effect,” since it is attributed to the desire—not necessarily conscious—of 
most persons to make a favorable social impression. The facade effect, 
however, is only a partial explanation that does not apply to all persons. 
Another explanation is the fact that the behavior of most individuals in 
a given society actually conforms, in greater or lesser degree, to the cul- 
tural stereotypes. Their behavior is, thus, not a facade. 

Edwards developed a scale to evaluate the effects of the social-desirabil- 
on respondents’ attitudes toward traits and behaviors listed 
y inventories. He submitted 150 items from the MMPI to 


nique will provide o 
be similar to or iden 


ity factor up 
in personalit 


s As an illustration of possible clinical use, see (38, pp. 258 ff.) and (39). 
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ten judges, who were asked to give the socially desirable responses to 
each. From the viewpoint of a favorable social attitude toward a re- 
spondent, how should each item be answered? The judges agreed per- 
fectly on 79 of the 150. On the basis of later statistical analysis, 39 items 
were selected. These, after experimental tryout, showed the greatest dif- 
ferentiation betwen a high-scoring and a low-scoring group on the 79- 
item scale? The score an individual earns on this scale is regarded as 
a measure of his tendency to give socially desirable responses in sell- 
description. The criticism of the concept of facade effect, in the preceding 
paragraph, should be borne in mind. 
Edwards combined the forced-choice ty 


pe of item with the social- 
desirability concept to devise the Persona 


l Preference Scale, in which 


od resulted in a psychological 
ts could be lo- 
e the influence 


atements therein. Two 
statements representing different Personality traits constitute an item. 


The two statements in each item are equal (or nearly so) in regard to 
parently unrelated forms 

of behavior. “If one is now asked to choose that statement in the pair 
that is more characteristic of himself, it may be argued that the factor 
of social desirability will be of mu 
response than in the case of a Yes-No type of i 
The author's statistical analysis of this inv 
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needs listed by H. A. Murray et al. (34) and which are tested by means 
of the Murray Thematic Apperception Test (see Chapter 25). The se- 
lection of these needs as the basis of the scale gives assurance that it is 
devised to evaluate aspects of personality which have been derived from 
extensive research. 

Tests or VALUES. A test of values, in contrast to one of attitudes, pur- 
ports to measure generalized and dominant interests. The Study of Values 
(Allport et aL), for example, is based upon six categories of values, as 
classified by Spranger (52). The items are intended to measure the rela- 
tive prominence of the subject's interests, for the purpose of classifying 
his values. The six categories are: theoretical, economic, esthetic, social, 
political, and religious. According to this classification, the dominant 
interest of the theoretical man is discovery of truth; the economic man is 
interested in what is useful; the esthetic man values form and harmony 
most; the highest value of the social type is love of people; the political 
man is interested primarily in power; and the religious man places the 
highest value on unity, in an effort to comprehend the cosmos as a whole. 
This test of values presents forty-five problem situations, under each of 
which the subject is required AE T A e A paaa An from 
multiple choices—responses which are indicative of degrees of the six 
types of values. For example: 


I in object of scientific research should be the discovery of truth 
The main o Ee ana 
rather than its practical applications. (a) Yes; (b) 


Do you think that à good government HONK AA Sy Qe 


ing statements are to be ranked in order of prelerence)) 

(a) more aid for the poor, sick, and old 

(b) the development of manufacturing and trade 

(c) introducing highest ethical principles into its policies and 


: diplomacy 
(d) establishing a position of prestige and respect among nations 


It is not to be assumed that these six are "natural" 
include all possible value groups, or that individual 
entirely under one or another. As a matter of fact, 
mixture of two or more of these value groups, some y 
and more dominant than others in each person. T 
Were employed as starting points for the investigati 
of life which, among others, serve to give unity 
mature person. 

A different approach to the study of values of high-school 
Students and adults is found in the Sims SCI Occupational R 
This scale is devised "to reveal the level in our social structu 


Wwe— 


Social class—with which a person SPIRI LONG 


types, or that they 
S can be classifieq 
most persons are a 
alues being Stronger 
he six classifications 
on of complex views 
and Purposefulness to the 


and college 
ating Scale, 
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scale lists forty-two occupations, representing all levels of socioeconomic 
status. The subject indicates whether persons following each of the 
occupations generally belong to the same, or to a higher, or toa lower 
social class than he does himself. Sims states that “. . . by examining the 
occupations which the subject indicates are those whose followers belong 
to his own social class, we are able to determine the position which he 
assigns himself in our society.” This inventory may be regarded as one 
that estimates one set of social values; for affiliation with a socioeconomic 
group usually signifies acceptance of the major values of that group. 
CHECKLISTS. In Chapter 21, the use of checklists for rating persons 
other than one's self was explained. Also available are several check lists 
for self-rating. Although these are intended to assist the individual him- 
self and his counselor in more readily identifying sources of behavioral 
and adjustment difficulties, they also serve to indicate the subject’s values 


and his attitudes toward persons, institutions, and other aspects of his 
environment. The Mooney Problem Check 


type of instrument. Forms are available for 
college, and for adults, Among the areas sa 
health, school, home and family, 
cerns, morals and religion, finances, 
checked items are not scored, but t 
in some areas, for class discussion. 

The areas included in the che 


List is representative of this 
junior high school through 
mpled are the following: !! 
boy-girl relations, self-centered con- 
economic security, and courtship. The 
hey serve as a basis for counseling or, 


cklist were selected through an analysis 
of written statements of problems obtained from several thousand high 


school students, as well as from counseling and clinica] sources. The 
Mooney list is thus an example of content validity.12 

BIOGRAPHICAL DATA QUESTIONNAIRES, 
pose is to sample several areas of a person’s experiences that appear to be 
associated, directly or indirectly, with behavior and success in a particular 
occupation or situation. These aspects may include the intellectual as well 
as the nonintellectual; for example, education, religious activities, oc- 
cupations, interest in athletics, marital record, financial status and in- 


With this type of device, the pur- 


? The areas sampled var 


y in the several levels: 
School, eleven 


junior high school, seven; senior high 
; college, eleven; adults, nine. 
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attitudes, etc., are indicators of what may be expected in the future 
Actually, biographical data questionnaires are variations on the familiar 
application blank, including a wide range of information, with answ 
to be given in multiple-choice form. i B 
During World War II, this technique was employed in the armed 
forces in an effort to improve the selection of personnel for various train- 
ing assignments. The results were not highly successful; the correlation 
between questionnaire rating and degree of success as a pilot was 30. 
These results, however, were regarded as encouraging enough to warrant 
continued research with biographical i i i 
1 a questionnaires (20, 57) i ivili 
situations. rege QU 
The following is an item from the questionnaire used in the U.S. Air 
i 7 : a 
Force during World War II-(19, P- 772). 


y any of the following types of work which you have done at any time 
and for which you have received remunerati 
ation. (More than 

marked.) ndis. 

A. Manufacturing industries (machin 

€ operator, factory ha «ti 

Nc figu y hand, textile 

B. Technical trades (baker, electrician, radio repairman, etc.) 

C. Transportation and communication (truck driver, linesman deckhand 
etc.) i : 

D. Business trades (store clerk, salesman, agent, window dresser etc.) 

. H H i ^ 
E. Public service (fireman, policeman, forest ranger, soldier, etc.) 


A recent attempt to devise a biographical questionnaire for selection of 
educational administrators sampled information in the following are: 

i : as: 
childhood and early background; professional preparation; health; in 
terests; early signs of leadership; heterogeneous items (20). Two m 1 
items follow. er 


During most of the time before you were 16 you lived: 
A, with both parents " . 
B. with one parent ] ; 
C. with a relative 

D. with foster parents or nonrelatives 
E. in a home or institution 


How often were you al i “ CONT 
Wn Trak y eader of your childhood Sang activities up to the 

A. always 

B. frequently 

C. occasionally 

D. seldom or never 

E. never a member of a Bang; or can't remember 
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Although this investigation proved of little value for the iiec o 
school administrators, it is reported here as representative of the approach 
to the problem and of the early stages of a technique that may prove 
fruitful for purely practical purposes. ] 

Since the introduction of the biographical questionnaire in personality 
study, efforts have been made to organize the data into patterns and 
clusters, and to assign trait names. Multiple-choice items are written to 
sample biographical experiences that will represent these patterns and 
clusters, from which predictions of subsequent performance might be 
made. Data obtained with these items are then validated against external 
criteria to estimate the predictive value of the questionnaire. 

The biographical-data method has within it a potentially serious defect. 
It is a purely empirical procedure, in which an attempt is made to find 
the predictive value of each biographical datum with respect to “success 


in a specific position or in a type of occupation. Since this type of ques- 
tionnaire does not deal with abilities, skills, 


and since it is validated against job-perfor 
"progress" of individuals in an organization 
of biographical data is largely a matter 
preferences of raters or of the particular organization. For example, it 
may be found that being "rural-born" (rather than "city-born") has a 
fairly heavy positive weight in predicting progress to a school super- 
intendency, or to an executive post in industry. This asset would indicate 
nothing about the person’s abilities for the position, about his per- 
sonality traits, or about the actual demands made by the position itself. 
Such a finding would, however, assist in an analysis of the attitudes and 
values of prospective employers. This criticism will not apply, of course, 
to all biographical information obtained through the questionnaire. 
Evaluation of Tests of Attitudes and Values. There is little to 
choose between the methods of attitude testing, so far as reliability is 


levised, the median reliability coefficient 
low .60, and even below -50, while others 
es, the coefficients were so low as to be 
Scoring (Thurstone's and Likert's) are 
(about .90), as shown when the same 


and basic personality traits, 
mance ratings and against 
; the predictive significance 
of the standards, values, and 
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For example, consider two persons, one of whom is extremely hostile to 
churches while the other is indifferent to them. Their scores on the scale 
will differ significantly; but the indifferent person might attend and sup- 
port a church as little as the hostile one. On the other hand, another in- 
different person might attend regularly for social or economic reasons. 
Similarity of manifest behavior may be demonstrated, also, by one who is 
hostile to the foreign-born and by one who merely avoids them. The fact 
that attitudes and overt behavior need not correspond makes validation, 
in the usual terms, a near impossibility. It is reasonable to conclude, 
however, that if individuals make a genuine effort to respond accord- 
ing to their own attitudes, these scales are useful in evaluating the beliefs 
of the respondents, as of the time the responses are given. 

As in all other forms of psychological testing, we are interested in 
knowing whether attitudes and values change over a period of years. 
Kelly (23) had 300 engaged couples fill out questionnaires during the 
years 1935-1938 and retested a very large percentage of them in 1954- 
Scores on the two sets of tests were correlated, with the following results. 


Allport-Vernon: religious, .60; theoretical, economic, and esthetic, in the 
-505; political, in the high .40s; social, in the low .gos. 

Remmers Generalized Attitude Scale: gardening, housekeeping, entertain- 
ing, church, in the .gos; rearing children, about .15; marriage, about 07. 


When one takes into account the fact that the Allport-Vernon retest 
correlations after one year (reliability) were .75 or lower, the coefficients 
obtained after 20 years indicate an impressive degree of stability. The 
Remmers retest correlations, for Forms A and B, were between -70 and .80, 
The much higher long-term stability shown by the Allport-Vernon in- 
dicates that it is sampling more fundamental aspects of personality than is 
the Remmers. Furthermore, with the exception of attitude toward 
"church"—which is not necessarily the same as a "religious" attitude— 
the Remmers scales measure attitudes that are based upon limited ex- 
perience, or none at all, or are susceptible to chance experiences of a 
fortunate or an unfortunate kind. In general, whether retest scores will 
or will not change significantly after a long interval will depend upon 
each respondent's experiences, including education, in the interim, 


Opinion Polling 


Opinion polling is essentially a method of finding out the atti- 
tudes and values of a specified population. It has become a specialized 
field of study and practice. Opinion polling has been concerned also 


with a great variety of subjects dealing with social, economic, interna- 
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tional, military, and other questions, and also with questions of con- 
sumer preferences, usually called "market research" and now called, by 
some, "motivational research." 

Although some opinion studies use several questions in surveying 
an issue, many employ only a single question. The question may be 
given in one of several forms: some require merely a yes or no answer; 
some require a rating of intensity or degree, such as strongly approve, 
approve, etc., or very much, much, etc.; at times the respondent is asked 
to check or rank items in a given list; sometimes the respondent selects 
one of two alternatives; occasionally the question is of the "open-end" 
type, in which the respondent completes a statement or sentence to suit 
himself. 

The mailed questionnaire, which had been in use long before opinion 
polling became popular, is another form of opinion gauging. This form 
of questioning, however, presents several serious disadvantages, so that 
it is not as widely used as formerly. Representative m 
ficult to obtain or develop; 
is often small and atypical 
understood or correctly int 
intelligence scale; and the semantic problem is always present. 

Opinions are obtained 


the population to be polled has been de 


fined, it may be sampled by one 
of three most common methods: 


(1) random sampling; (2) stratified 


istics of the population in the v 
selection, individuals of 


» even if repeated visits to the homes are 
necessary, 


E area-sampling method is infrequently used because it is so ex 
pensive; it requires that relevant information be obtained regarding every 
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family within specified areas and that extremely elaborate files be kept 
It is doubtful whether the advantages of the area-sampling itin 
warrant the cost and the attempts to obtain information of kinds that 
many people will, with justification, regard as an invasion of their pri- 
vacy. 

The major problems and difficulties in opinion polling will be in- 
dicated without elaboration. 


With any method of sampling, it is extremely difficult to get a completely 
unbiased sampling because of unknown chance errors, unknown selective 
factors, and errors of judgment in evaluating traits and responses of some 
persons. 

It is doubtful that opinions can be correctly gauged by a single question, 
except in special instances such as elections, when the respondent makes a 
choice between candidates. 

Answers to questions may be deliberately falsified or there may be lack 
of frankness which results in a large number of “undecided” or neutral 
answers. 

All respondents do not necessarily interpret a question or statement in 
the same way; the same question or statement has varying connotations for 
different persons. 

Some respondents do not understand the language or phrasing of the 
question or statement. 

Some respondents do not know or understand the issue being dealt with. 

Individuals who do not actually have an opinion feel at times constrained 
to express one anyhow. 

Over-all percentages of responses vary with different ways of stating a 
question. 

When the open-end question is used, it is difficult to classify the responses, 

Responses are influenced by the training, influence, and possible bias of 
the interviewer. 

Significant differences in status between interviewer and respondents may 
influence answers: for example, responses of Negroes to Negro may differ 
from those of Negroes to white. 

Verbal expression of an opinion does not necessarily indicate the re- 
spondent's actions. 

Different persons have different reasons for giving the same responses; 
and persons giving different responses may do so for a common reason. 
Psychologically, it is necessary to determine, through skillful interview, 
the reasons for a response. 

Some of these difficulties and problems can be met in part if the ques- 
tions are stated unambiguously and simply and are easily understood b 
persons for whom intended; if the respondents are familiar with the ae 
if the respondents could reply by "secret ballot" or could be sure of ye 
maining anonymous; if more than one question is used for a given issue 
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As psychologists, however, we are primarily concerned ee 
why individuals hold certain opinions about the issues that are polled, 
rather than with learning only that certain percentages of the sampling 
think thus or so. Why do certain persons choose to vote as they do? Why 
do a large majority of a certain category of housewives prefer pastel: 
colored refrigerators to white? Why are some individuals hostile to per- 
sons of Oriental origin? Determination of reasons for these and even more 
subtle behaviors, attitudes, and values requires interviewing by qualified 
psychologists or other qualified professional persons. So long as opinion 
polling continues to be a matter of classification of responses and deter- 
mination of percentages, it remains, aside from technical considerations, 
a statistical problem of significance principally to political scientists, 
sociologists, and consumer-research specialists. 
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EAT 


PROJECTIVE METHODS: 
THE RORSCHACH AND THE 
THEMATIC APPERCEPTION TESTS 


Definition and Explanation 


Psychologically, projection is an unconscious process whereby an 
individual (1) attributes certain thoughts, attitudes, emotions, or char- 
acteristics to other persons, or certain characteristics to objects in his 
environment; (2) attributes his own needs to others in his environment; 
or (3) draws incorrect inferences from an experience. Projection is not 
recognized as being of personal origin, with the result that the content 
of the process is experienced as an outer perception and of external 
origin. 

A projective test, then, is one that provides the subject with a stimulus 
situation, giving him an opportunity to impose upon it his own private 
needs and his particular perceptions and interpretations. The several 
forms of the projective method (pictures, inkblots, incomplete sentences, 
word associations, one’s own writings and drawings, and others) are in- 
tended to elicit responses that will reveal the individual's “personality 
structure,” feelings, values, motives, characteristic modes of adjustment, 
or “complexes.” He is said to project the inner aspects of his personality 
through his interpretations and creations, thereby involuntarily revealing 
traits that are below the surface and incapable of exposure by means of 
the questionnaire type of personality test. 

Personality inventories are standardized questionnaires that ask how 


€ respondent feels or acts in a variety of representative situations. Pro- 
612 


th 
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jective tests, by contrast, are more or less unstructured; ! instructions are 
general and are kept at a minimum to permit variety and flexibility of 
responses; the responses, which are neither right nor wrong, are the sub- 
ject’s own spontaneous interpretations or creations. Projective tests are 
expected to elicit responses involving not only cognitive factors (that is, 
those that relate to what is present to the senses and to which meaning is 
given), but also affective factors (that is, feelings about what is there). 

The most widely used projective techniques (for example, the Ror- 
schach inkblots, the Murray pictures) are, among other things, tests of 
perception and meaning, both of which are dependent upon individual 
mental processes.? The less clear-cut the situation, the greater will be in- 
dividual differences in perceiving it. These tests, therefore, provide rela- 
tively unrestricted opportunity for the exercise and expression of indi- 
vidual differences in perception; for each subject sees what he himself is 
disposed to see and does what he is personally disposed to do. In so do- 
ing, and through the manner of confronting and responding to the stimu- 
lus situation, the individual revels some aspects of his personality.? 

By contrast with inventories that attempt to evaluate personality traits 
individually, the results of a projective test are used to interpret and 
understand a personality as à whole. Although specific meanings may be 
given to certain partial scores, as Rorschach himself did, the components 
of the whole test must also be interpreted in their interrelationships. This 
viewpoint is generally known as the holistic or organismic theory, accord- 
ing to which the whole and its parts are mutually interrelated, the whole 
being as essential to an understanding of the parts as the parts are to an 
understanding of the whole. According to the holistic principle, the 
measurement and evaluation of components alone does violence to the 
organized structure of the whole. 

The holistic conception of personality has emerged from clinical studies 
of numerous individuals whose behavior could be understood only in 
terms of the interrelationships and interdependence of traits, and from 


1The term “unstructured,” as used in projective testing, means that the elements or 
attributes of the situation do not form a uniform and clearly defined pattern for all who 
encounter it. The term is synonymous with “ambiguous” in that the stimulus situation 
can elicit a variety of responses among persons tested, as well as a number of different 
responses from an individual. Projective tests, it will be seen, differ in the degree to 
which they are unstructured. 

2 Normal perception is defined as awareness of objects, conditions, and relationships as 
unified, articulated mental structures. Perception is also defined as a mental complex 
or integration that has sensory experiences as its core. Disturbances of perception will 
be shown by lack of integration, distortion, and bizarreness. 

3'This is not to say that each person's perceptions and responses are completely idio- 
syncratic. To each stimulus situation there are certain responses that are quite frequently 
obtained from certain groups. But there are also numerous individual variations and 
combinations which give each person's total response pattern its individuality. 
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the many experimental investigations of perception and behavior. Pro- 
jective methods, therefore, are regarded by many psychologists as the most 
valuable type of personality test because they are concerned with a com- 
plex of psychological aspects of the individual. 


The Rorschach Test* 


Description and Procedure. This is the well-known and widely 
used inkblot test, named after Hermann Rorschach, a Swiss, who began 
his experimentation with inkblots as a means of stimulating and testing 
imagination. He was not the first investigator to perceive the possibilities 
of inkblots in experimental psychology, although his work was the most 
extensive of any, having continued from 1911 to 1921. He is credited with 
being the first to develop a technique for their use in personality diag- 
nosis. He also changed the emphasis from content analysis to deter- 
minant analysis, which is explained in the following pages. Rorschach 
developed his test and methods as a practical tool to be applied to clinical 
cases in the study of unconscious factors in perception and meaning, and 
to reveal dynamic factors of behavior and personality. He proceeded on 
the principle that every performance of a 


person is an expression of his 
total personality, 


the more so if the performance is concerned with non- 
conventional stimulus situations in response to which one cannot wilfully 
conceal his individuality. In responding to inkblots, the subject is gen- 
erally unaware of what he reveals by the reports of what he sees. Yet in 
telling what he perceives, he provides insights into his personality. 

The Rorschach Test—used from the nursery-school level through 
adulthood—consists of ten cards, on each of which is one bisymmetrical 
inkblot. Five are in black and white with differently shaded areas. Two 
contain black, white, and color in varying amounts; 


three are in various 
colors (‘‘chromatics’’) 


The cards are presented to the subject one at a time and in 


prescribed 
sequence. The instructions are very simple; 


the subject is asked, accord- 


‘The purpose of this and the following chapter is to familiarize students with the 
essential characteristics of the instruments, their Psychological rationale, their uses, 


and the major problems they present. Since the Rorschach and the TAT are prominent 


not only in psychology but also in anthropology, sociology, education and psychiatry, 
the student of the general field of tests and testing must have more than a cursory and 
fleeting glance at these instruments. The materials that follow are minimal for a rea- 
sonably adequate understanding of the characteristics and purposes of projective tests. 

*For a bibliography of early investigations, sce J. E. Bell (12, pp. 75-76). See also 
Buros' several yearbooks on mental measurements, especially the fourth and fifth. The 
great bulk of rescarch on the Rorscha: ublished since 1935. For Rorschach's 


own thinkin: terpreters are Oberholzer (123) and 
Beck (11). 


ch has been pi 
B sce (122). Rorschach's most literal in 
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Fic. 25.1. An inkblot similar to the Rorschach blots. 


ing to Rorschach's own formula: "What does it look like? What could 
this be?" Several clinicians and investigators, who have used the test 
extensively, have somewhat modified the original instructions, though 
not in their essentials. Klopfer and Kelley, for example, use this formula: 
“People see all sorts of things in these inkblots; now tell me what you 
see, what it might be for you, what it makes you think of" (79). The 
principal differences in the directions of the several specialists who have 
evolved their own formulas is in the amount of encouragement and 
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urging used to elicit from the subject the fullest possible response to 
each card. . 

Rorschach did not impose time limits; nor do present users. Nor is 
there any fixed number of responses for each card. The examiner makes 
note of various aspects of the subject's behavior: namely, a verbatim 
record (so far as possible) of the responses; time elapsed between presenta- 
tion of each card and the first response to it (reaction time); length of 
time in long pauses between responses; total time required for each 
card (response time); position in which card is held for each response 
(indicating extent of the subject's exploration of the stimulus situation); 
the subject's extraneous movements and other behavior of significance. 
The three recordings of time are useful in determining emotional block- 
ing or resistance to what the individual might be perceiving in a particu- 
lar inkblot. 

Directly after all ten cards have been presented for responses, a second 
phase, the inquiry, follows.? There are two main purposes of the inquiry. 
The first is to learn which aspects of the blot initiated and sustained the 
association process: response to wholes, parts, small details, location, color, 
shading, apparent movement—all of which are essential items of infor- 
mation for scoring purposes. Second, the in 
opportunity to add to, 
done, it must be comple 
without any suggestion 
should be asked are thos 


s. Too many or 
rs which do not 
rather, from suggestions im- 
ecific questions asked are to 


Scoring. Following Rorschach's method for the most part, the 
scoring is based upon four major categories. 

Location. The first is the location, or the area, which has been per- 
ceived as the basis of each response. This may be the entire inkblot, a 
large portion, a small portion, a minute detail, or part of the white back- 
ground. The area may be well defined, or merely vague and blurred. Lo- 
cation of responses is the basis of obtaining scores for wholes (called W), 
large usual details (D), and small usual details (d), unusual detail (Dd), 


"Some examiners prefer to conduct the in 
of each card. Some users of the Rorschach have 


own in administering and sc 


quiry immediately after the presentation 


introduced occasional innovations of their 
oring, but these will not be described. 
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and the white spaces (S), which are parts of each person’s pattern of 
response to the entire test. Additional symbols are used to designate other 
aspects of location; but these five are the major categories. 

The locations of responses and the subject's ability to delineate them 
are regarded as indicative of his perceptual organizing processes, of his 
ability to analyze and articulate the parts, and of his associations as his 
perceptions shift within each blot. Analysis of responses in respect to lo- 
cation is said to reveal extent of the subject’s perceptual organization or 
disorganization, measured in terms of agreement with norms of percep- 
tion, and ability to analyze the whole and synthesize the parts. 

DETERMINANTS. The second category includes the determinants, or 
characteristics, of the inkblot as perceived by the subject. The determin- 
ants are those aspects or qualities of the blot that have produced the 
responses to it. These may be form, shading, color, perspective, or motion 
—or combinations of them. Forms may be perceived with ordinary ac- 
curacy (F); or they may be unusual and clear percepts (F+); or poor 
percepts (F—). Generally, evaluation of form is a matter of the ex- 
aminer's judgment, although some investigators have provided normative 
descriptions and numerical scores (78, 8o, 116). 

The frequency, intensity, and interpretation of shading noted by the 
subject are recorded. The manner in which the subject responds to the 
shading (K) of the blots is said to be relevant to the manner in which he 
meets and satisfies his own affectional needs; whether by conscious denial 
of affectional need, or by a repressive mechanism, or by insensitivity and 
undeveloped affectional relationships with other persons. 

With regard to color (C), the examiner records the particular colors 
reported and the manner in which the subject combines color with form, 
there being three categories: responses to pure color without form being 
involved (recorded as C); responses to a combination of form and color, 
in which form is dominant (FC); and responses to a combination of color 
and form in which color is dominant (CF). 

A score for movement is given by most examiners when the subject 
perceives something going on in the blot, whereas Rorschach himself re- 
stricted the movement score (M) to responses that indicated empathy; 
that is, a true experiencing of, or identification with, the movement re- 
ported (obviously an extremely difficult phenomenon for the examiner to 
discern). At present a common practice is to reserve the symbol M for 


human movement, to designate animal movement as FM, and inanimate 
movement as m. 


The subject's mention of perspective or depth (FK) is also noted and 
scored. Parts of the inkblots are perceived as having perspective and be- 
ing three-dimensional. FK, “in reasonable numbers," is said to be related 
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to good adjustment, through attempts to handle affectional anxieties by 
i ive efforts. 

"ena tacos are the principal determinants as developed by Ror- 
schach, and as further developed and modified by Klopfer and his col- 
laborators. Somewhat different sets of determinants have been developed 
from Rorschach's original ones by several other psychologists, notably 
S. J. Beck (11). Although there are some differences in symbols used 
and in details of response classification, the conceptual similarities are 
much greater than the dissimilarities. One of Beck's categories, in par- 
ticular, should be mentioned, that is, what he calls organization (desig- 
nated by Z). This determinant, an extension of the concept of the whole 
(W), indicates ability to perceive (or create) "new and meaningful rela- 
tions between portions of the figure not usually so organized." Organiza- 
tion is said to be related to level of intelligence. A relatively low Z 
rating, it has been observed, m 


iety, or unresponsiveness in persons known to be of superior intelligence. 
Content. The third scoring category is content. 


merely classified into groups; they are 
ascertaining the subject's personal 


"complexes" Some examiners 
have interpreted content items, also, as having psychiatric or psycho- 


analytic meanings. For example, in some contexts the response ''eyes 
looking at me"—in some of the cards—is given the obvious interpretation 
of “paranoid reaction." "Puppets" or “marionettes” Perceived in a card 


are interpreted at times to suggest schizoid tendency, as a feeling of being 
influenced and directed by hostile persons, 


ORIGINALS AND POPULARS. The fourth scorin 
—also known as “popularity-originality.” This ] 
of a response as one that is commonly given ( 


§ category is originality 
las to do with the rating 


popular and which original, althoug 
which there is no doubt. However, 
achieve satisfactory agreement in T 
sults, this problem of popularity: 
statistically, in a manner similar t 


B to the foregoing categories, or ac 
ions and elaborations thereof, is not 
8), Thetford (146), Ames (4, 5, 6). 


cording to any one of the modificat 
* For example, Hertz (5 
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an end in itself. The major purpose of the test is to assess the subjects 


general adjustment, and to learn whether he is experiencing psychological 
difficulties—in short, to get insights into his personality, 
be obtained ordinarily by direct questioning. 
Although considerable experience under supervision is necessary to 
learn the techniques of administering and scoring the Rorschach, much 
more experience and expertness are required for the interpretation of 
scores. With the Rorschach, the two aspects—expertness in a scoring 
system and skill in psychological interpretation—are essential. 
Interpretation. Having scored each response and tabulated 
the results, the next and significant step is to analyze the relationships 
existing between the frequencies in several categories. For example, the 
following are among the relationships investigated: the proportion of re- 
sponses in each of the scoring categories; the ratio of wholes to larger de- 
tails and to minute details; extent to which form is used with other 
determinants, such as color and shading—form being dominant; extent 
to which color is used with other determinants—color being dominant; 
relationship between color and movement responses; frequency with 
which color is named apart from other determinants; the 
movement responses to animal-movement responses; ratio of pure move- 
ment responses apart from other determinants to movement plus other 
elements, especially form; percent of original responses. These and other 
comparisons and interrelationships indicate the organization and pat- 
terned characteristics of the individual’s personality, as measured by the 
Rorschach test. For instance, responses indicative of strong emotion, and 
thus a sign of possible danger, in one record may be regarded as relatively 
serious; in another record, if there are balancing factors, they may be 
indicative of satisfactory adjustment. 


The frequencies of responses in each of the 
ratios and interrelationships are the bases up 
sonality is delineated. 

Location. The locations of responses are used mainly in evaluati 
intellectual aspects of personality: the manner of approach toa perceptual 
problem; the preferred mode of apperception. A predominance or large 
percentage of whole responses (W)—that is, emphasis upon the whole 
rather than the particular—is regarded as being characteristic of persons 
of higher levels of capacity for intellectual organization and abstr 
Not only the number of wholes is important here; the originality and 
appropriateness of the responses must be considered; for simple, popular 
wholes indicate superficiality and commonplace thinking. 

Predominance of common responses of detail (D) is regarded as evi- 
dence of concrete, unoriginal, practical mental processes. Responses of 


such as could not 


ratio of human- 


categories and the various 
on which the subject's per- 


ng 


action. 
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unusual detail (Dd) indicate perception of the unusual, associated at 
times with precise and critical mentalities. But carried to an extreme, 
rare detail responses may indicate an obsessive preoccupation with the 
trivial, often accompanying states of anxiety. 

Rigidity of approach to problems is said to be indicated by a subject 
who uses the same procedures with all cards in making his reports, be- 
ginning with wholes, then Proceeding to larger details, minute details, 


Form. If a person’s form perception is clear and accurate (F+ or 


subject may entirely suppress his response; he may Spontaneously respond 

ociating it with a form or object; he may 
one or the other being dominant. The degree 
€ or principal determinant in a subject's re- 


ndicative of emotional impulsiveness. 
T response, "color shock," has been ob- 


——— — — 
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by peculiar responses; or by inability to respond to color. Color shock is 
believed to imply anxiety neurosis; the person's ability to respond is 
seriously impaired by loss of emotional equilibrium because of affect 
produced by color. 

The concept of color shock and the actual appearance of the phe- 
nomenon, however, have been questioned and subjected to experimental 
study. Most of the experimental data do not support the original con- 
cept. But critics of these statistical studies maintain that this concept and 
the interpretation of color responses have been dealt with in a mechanical 
manner, without due regard to the individualized way in which color is 
handled in the record. They hold that how the subject responds to color, 
if at all, is more significant than statistical counts, in judging the impact 
of color stimuli upon emotional responsiveness. These critics of the usual 
Statistical analyses point out that the following aspects of response must 
be evaluated in clinical interpretations of color responses: color selection, 
color shyness, color denial, color avoidance, and disregard of color ac- 
companied by overt disturbance. 

The disagreement noted regarding this area of Rorschach interpreta- 
tion is an example, in general, of a principal basis of difficulty en- 
countered in efforts to objectify interpretations of responses and to 
validate the instrument as a whole; that is, the conflict between those who 
would break down the responses into elements for purposes of statistical 
analysis and those who maintain that by so doing the significance of each 
element is destroyed, since each must be viewed in its relationships to the 
whole. Those in the latter group maintain, also, that available Statistical 
methods are inadequate for handling this problem. 

SHADING. Responses that use shading are said to be related to the 
manner in which the subject meets his affectional needs. These responses 
are interpreted as being related to anxiety, depressed attitudes, and feel- 
ings of inadequacy. 

MovemENT. The movement score 
cording to Rorschach, is evidence of t 
the higher the score, the richer the associations 


an movement, combined 


color responses, accompanied by few reports of hu 
cate the extroversive personality. 


Beck, one of the most productive students of the Rorschach reports 
" 
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that the significance of movement responses differs with various person: 
ality organizations. In emotionally stable adults and in some neurotics, 
it is as stated above. In cases of schizophrenia, movement response is im 
dicative of a highly subjective and personal experience. In adults having 
adjustment problems but without psychosis, it represents fantasy living; 
in the manic, it indicates egocentric wish fulfillment. . 

Klopfer, who has published extensively on the Rorschach, makes a dis- 
tinction between reports of human movement and those of animal Inox: 
ment (79). In respect to the former, he agrees essentially with Rorschach E 
interpretation. But a large proportion of animal movement, according to 
Klopfer, indicates that the person is functioning on a "level of in- 
stinctive prompting" rather than at the level of creative activity. 

Other clinicians have added still another interpretation to movement 
responses. ‘They report that a high movement score, combined with satis- 


factory form, originality, detail, and organization, indicates superior in- 
tellectual ability. 


CoNrENT. The number, proportions, and kinds 
in the responses have been var 
significance (fantasy life, symbolic meanin 
stereotyped thinking in the subject; as a sign o 
as revealing feelings of inferiority; as reflecti 
Obsessions, and compulsions. Much of this cont 
tive and needs experimental confirmation. 

ORIGINALS AND POPULARS. 


f maturity or immaturity; 
ng the subject's interests, 
ent interpretation is tenta- 


evaluated; for these might be no more than bi 
perception. 


INTERRELATIONSHIPS. 
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should be apparent, too, that there is an appreciable degree of sub- 
jectivity in both the scoring and interpretation of responses—and neces- 
sarily so, in dealing with an unstructured test. This means, of course, 
that skill in the use of such instruments and the attainment of maximum 
validity can be achieved only after carefully supervised practice and 
experience. 


(Differentiated 
| Shading) Y 


NUMBER OF RESPONSES 
( To be filled in by Examiner) 


M | FM| m c[rcicr|c 


F 
Bright Color 


Fic. 25.2. Rorschach response profile. From B. Klopfer and H. H. David- 
son, Individual Record Blank, New York: Harcourt, Brace & World, 
Inc., by permission. 


Texture and 


Diffusion — Vista |Form Achromotic Color 


Movement 


Figure 25.2 is a profile form devised by Bruno Klopfer and Helen H. 
Davidson. As reproduced here, it shows the number of responses of an 
actual case in each category, to illustrate the manner in which results are 
portrayed. Table 25.1, also by Klopfer and Davidson, is shown so that 
the student may see more clearly the extent to which the various scores 
are interrelated and comprehend more clearly what is meant by inter. 
pretation of Rorschach responses as a whole. The information in this 
figure and the table provide the starting materials for interpretation. 

PERSONALITY STRUCTURE. The Rorschach test is a “multidimensional 
instrument” that is intended to yield information on t 
examinee's personality. Three major dimensions ar 
intellectual activity, externalized emotions, 
Each of these is measured in terms of the s 
plained. The term "structure" designates th 
ous personality traits or aspects are interrel 


he structure of the 
€ evaluated: conscious 
and internalized emotions. 
everal categories already ex. 
€ manner in which the vari- 
ated so as to produce each of 
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r 
the three major “dimensions,” which, themselves, are interrelated in 
producing the individual's whole personality. The organization, or 
"structure," of a person's traits is said to indicate how he experiences life 
around him and how he utilizes some of his experiences. What are his 
characteristic perceptions, attitudes, and behaviors that result from the 
organization of his traits? For example, in regard to structure, some of 
the questions that the Rorschach pattern of responses is intended to an- 
swer are: Is the subjects mentality original or stereotyped? Are his 
abilities creative or reproductive? Is he less adaptable or more adaptable 
to reality? Is the "inner" or the "outward" life stronger? Or, more spe- 
cifically, how and to what degree does the individual control his emo- 
tions and feelings? Is he prone to anxiety? 

Taking one of the three major dimensions—intellectual activity—as 
an example, it would be appraised by the scores on the following aspects 
of the responses. Each aspect is accompanied by the factors regarded as 
significant elements in interpretation of mental level. 


Perception of form: quality and percentage of total responses; clearness 
vs. vagueness; accuracy vs. inaccuracy (F, F+, or F—). 

Perception of wholes: number and percentage of total responses: ability to 
integrate; ability in abstract and theoretical activity. 

Perception of major and minor details: number and percentage of total re- 
sponses; concrete or practical intelligence. 

Organization: ability to perceive and create new wholes; creative ability. 

Original and popular responses: qualitative richness of response vs. com- 
monplace, stereotyped thinking. 

Animal content: "sterility" or immaturity of thought processes (an excess 
of animal responses) vs. rich and insightful thought processes (perception 
of forms of a variety of categories). 


Productivity: total number of responses; intellectual energy. 

Sequence: rigid, or orderly and flexible, or loose, or confused approach to 
a problem, indicating level of intellectual control. 

Movement (mainly human movement): inventive and creative abilities vs. 
stereotyped thinking. 


Reliability. It is undoubtedly true that experienced, skilled 
clinicians are able to draw remarkably acute and sound inferences from 
Rorschach protocols; but this is a subjective matter. Ideally, it would be 
desirable to demonstrate objectively the inkblot test's reliability and 
validity. Considerable research has been devoted to that end. 

There are several problems and principles that should be kept in mind 
in evaluating studies on the reliability and validity of the Rorschach test 
First, while certain partscores are significant, interpretation is iin 
tially global in character; but as yet there are no satisfactory statistical 
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methods to deal with global patterns of responses. Second, rating or 
scoring of responses introduces a greater subjective element (than in 
the case of intelligence tests) because there are no right or wrong answers, 
the variety of responses is considerable, and normative data are in- 
sufficient. As criteria of validity, differential clinical diagnoses are not 
reliable enough; descriptions of behavior are nonquantitative and sub- 
jective to an appreciable degree. ; 

Studies on Rorschach reliability may be classified into the following six 
types. Each will be briefly explained and evaluated. 

1. Parallel series of cards 

Split-half correlations 
Test-retest correlations 
Matching interpretations of Rorschach responses 
Perceptual and conceptual consistency 


Attempts to falsify responses and to misrepresent oneself 
PARALLEL SETS. Two parallel sets of cards 
tion that they are analogous 
alternate series of inkblots, 


the same types of responses as do the Rorschach cards, 


EE GI cad) 


to warrant the use of the ] 
The Behn-Rorschach (80, 
two groups after a period of 20-2) days. The a 


subjects and one of 96 abnormal 


scores of the two sets were, respectively, .41 and -52; but the range of co- 


efficients for the separate scoring categories Was very wide: from about 
zero to .86. These results are typi 


Devising two equivalent or cl 


perceptual and conceptual responses. 
Spuit-Hatr. When this method is used 


to estimate. reliability, scores 
of responses to odd-numbered cards in ea 


ch of the Categories are cor- 
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related with those given to even-numbered cards; for example, scores for 
the number of wholes, good forms, large details, small details, and color. 
This method has yielded coefficients ranging from approximately .6o to 
:95 (26, 59). 

For several reasons, the use of the split-half method has been seriously 
questioned; and some clinical psychologists would abandon it. First, the 
method is inconsistent with the widely held principle that the significance 
of a Rorschach protocol is to be derived from the whole integrated set 
of responses; therefore, correlating isolated variables is invalid. This argu- 
ment is of questionable merit, since isolation of a variable for statistical 
analysis is one matter, while isolation for interpretation is quite another. 
The second point is that there are differences in the stimulus value of 
each of the ten cards; hence, since they are not similar in characteristics 
or behaviors to be elicited, the odd-even method should not be em- 
ployed. A third point—one that applies to all quantitative analyses of 
the Rorschach—is that scoring of responses is too subjective, owing to 
lack of normative criteria and to the large variety of responses possible. 
A fourth objection is that the number of responses, in individual pro- 
tocols and in each category, is often too small to be statistically significant. 

TEsT-RETEST. This method, using the same cards on two occasions, has 
also yielded coefficients varying from low to high. One study with young 
children reports reliabilities of determinants from -38 to .86, after a one- 
month retesting interval (39). In another study after a twenty-one-day 
interval, the average coefficient for the several scoring categories was .68 
for a group of adults (30). The size of the coefficient varies roughly with 
the length of the test-retest interval. Swift, for example, retested school 
children at several intervals with the following results. 


After fourteen days, coefficients varied from -59 to .83 for the several scor- 
ing categories. 

After a ten-month interval, coefficients varied from 318 to .53 for the 
several scoring categories. 


The findings for the fourteen-day interval are more significant as evi- 
dence of reliability (internal consistency) of a projective test, especially 
with young children who are developing rapidly and in whose lives ten 
months are a relatively long and significant period. 

Another reliability study, using twenty chronic i i i 
these reliability e rim 5 : PES p 
Location categories: .ķo to -95 
Determinants: +16 to .96 
Content: -36 to .94 
Relative scores: — .17 to .84 
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On the whole, with a few exceptions, the coefficients ranged from mod- 
erate to quite high. Low correlation coefficients, the authors state were 
due to small numbers of responses and narrow range of the number 
within each category. An additional reliability test was applied: test and 
retest tabulation sheets were matched by two experienced Rorschach 
examiners, They were correct in 85 percent of the selections, this being 
significant at the 1 percent level. In a study using one hundred college 
students, fourteen different Rorschach indexes of intelligence, obtained 
by testing and retesting, were correlated (3). The range of coefficients 
was from .13 to .93, with a mean of 63. 

A different approach, which may have promise, is the use of a scale 
to rate responses (9). A group of students rated previously obtained 
Tesponses on a five-point scale according to how well each response 
corresponded with what he himself was able to see in the blot. The in- 
vestigators’ hypothesis was that there will be a reliable positive relation- 
ship between obtained responses and ratings. Split-half reliabilities of 
the ratings ranged from .74 to .97, with a median of 86. Retest reliabil- 
ities, after a two-week interval, of the percentage of responses in each of 
om .51 to .go, the median being. 78. While this 
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Matcuinc. As a method of estimating Rorschach reliability, matching 
is regarded by many experts as most satisfactory, since the whole Rorsch- 
ach report is kept intact. This is a reliability technique in the sense 
that it attempts to answer the question: “Do the responses have the same 
meanings for different experts?" Internal consistency of responses, then, 
is estimated in terms of consistency of meaning conveyed by the re- 
sponses; that is, reliability of scoring and of interpretations. Krugman 
reports, for example, that three judges agreed perfectly in matching 
twenty Rorschach records with interpretations that had been made by 
others (81). Also, in this same study it was found that when two inde- 
pendent interpretations of the twenty response records were made, the 
two agreed essentially in respect to the significance of about go percent 
of the response data, and showed partial agreement on 10 percent. In 
another instance, when twenty-five records were matched with interpreta- 
tions, using six judges, the average coefficient of contingency was .87.8 

Other reports of matching are also favorable, but the number of 
matched protocols is small. When one evaluates findings in studies of 
matching, it is necessary to take into account the fact that there may be 
only partial agreement among the judges, yet other aspects of their in- 
dividual interpretations, which do not coincide with one another, may 
be valid too, because each, to some extent, has emphasized different 
aspects of the responses (76). When general evaluations of degree of ad- 
justment were made on a point scale, based upon the Rorschach records, 
the results were fairly good (36). Three qualified psychologists rated 146 
boys and girls on a four-point scale. The average correlation for their 
ratings was .67. 

Although matching is well regarded as a method, it has been infre- 
quently used, probably because it is most difficult to carry out with large 
enough numbers of cases rated and interpreted by a large enough num- 
ber of specialists. 

PERCEPTUAL AND CONCEPTUAL CONSISTENCY. A relatively recent and 
promising approach to the study of reliability is one whereby responses 
to the Rorschach blots are compared with responses to altered blots (28, 
96) and with responses to other types of tasks which, like the Rorschach, 
involve perceptual and conceptual tasks (g, 66). The findings thus far 
indicate that an individual’s responses to the several types of stimulus 
materials are rather consistent in regard to number of responses, variety 
of interpretations, and use of color. More research of this kind is neces- 


8 The coefficient of contingency is an index of correlation that is used when both 
variables are classified in two or several categories, rather than distributed quantita- 
tively in the form of continuous or discrete variables. See also Lisansky (91) and Palmer 
(108). 
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sary; but the findings reported provide evidence of reliability of the per- 
ceptual and conceptual responses evoked by Rorschach inkblots. 
ATTEMPTS TO MISLEAD. Attempts to falsify responses do not appear 
to be an appropriate method of estimating Rorschach reliability. Same 
psychologists have called this method “testing the limits of reliability. 
The results on almost any type of testing device can be distorted by 
examinees, depending upon their intelligence and degree of psychological 
sophistication. When this method is used, the experimenter is not actually 
trying to answer the question of reliability. He is seeking an answer to 


the question, “Is the Rorschach test resistant to a person's attempts to 
falsify respons 


sonality?” 
can be effected by deliberate effort; but the amounts and directions of 


as relative personality ma- 
; intellectual level, and amount of 
knowledge about and definiteness of “set” toward the test (39, 72). 

» other reports indicate that experimental subjects 
a generally good or poor Ini 
; in a large percentage of cases, in altering their 
persons, using the several scor- 
an appreciable extent 
and only two other subjects to a moderate 


; it is necessary to take into ac 
the instrument, the sub 


factors extrinsic to the te 
Caps, previous experience, attitude toward ex 


: n a test situation are not the product of only 
his “constant Personality traits reacting to constant test materials 

Available data do not unequivocally demonstrate reliability of the 
Rors 
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ployed largely to test and describe personalities that are maladjusted or 
are in a fluid state. Probably the more significant result is the percentage 
of agreement of judges in the interpretation of responses. In regard to 
the Rorschach at present, it may be concluded that some researches in- 
dicate satisfactory reliability, as usually defined, while others do not; but 
there is a much higher and a reasonable degree of consistency among 
experienced Rorschach examiners regarding interpretation of response 
data. 

Validity. Since Rorschach's original work, several validating 
methods have been used in addition to his own. These may be classified 
as follows: 

1. Known groups 

2. Rorschach diagnoses compared with diagnoses by psycho- 

therapists and clinical interviewers 

3. Rorschach findings compared with consistent observation of 

behavior over an adequate period of time 

4. Matching Rorschach interpretations with clinical case reports 

5. Comparing Rorschach protocols obtained before and after 

therapy, based upon changes in behavior 

6. Single Rorschach variables, or a combination of a few, re- 

lated to observed aspects of behavior 

7. Experimental validation: (a) influencing the subjects; (b) 

varying the stimulus; (c) relation of Rorschach responses to 
physiological reactions 
Each method will be briefly described and general results briefly noted. 
Known Groups. Rorschach himself used 288 mental patients who 
presented clearly discernible extremes of certain traits. In addition, he 
tested more than one hundred artists, scholars, and persons of average 
abilities, and also some mental deficients. Among these groups, Ror- 
schach found what he regarded as significant differences in characteristic 
response patterns. For example, in the neurosis pattern, the following 
distinguishing signs, among others, are some of the most important re- 
ported: very few movement responses, color shock, shading shock, few or 
no form-color responses, noncombined form responses constituting 5o 
percent or more of the total number of responses, refusals to one or 
more cards, small total number of responses. This pattern was interpreted 
as signifying lack of social adaptability, excessively rigid control, sup- 
pression of spontaneity and originality, and anxiety. Since this original 
work, many similar studies have been published. The results of some of 
them are reasonably satisfactory and in essential agreement with Ror- 
schach; but many agree only in part. 
Comparisons wiTH CLINICAL DiacNosrs. Individuals are examined 
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and diagnosed with the Rorschach. Also, another staff member makes a 
psychiatric diagnosis after interview. Findings are then compared. ut 
one study of 26 children referred to a clinic, the two sets of diagnoses 
were in essential agreement in 62 percent of the cases before psycho- 
therapy. One year later, the diagnoses agreed in 89 percent of the cases 
(185). This is one of the more favorable reports. In another study, among 
the unfavorable reports, only four of thirty-four Rorschach scores. dif- 
ferentiated significantly among the several groups of diagnosed patients 
(129). Among the large number of published studies, a wide range of 
findings is reported: favorable, indifferent, and unfavorable. 
COMPARISONS WITH OBSERVATIONS OF BrHavioR. Behavioral observa- 
tions of selected individuals are made continuously over a period of time 
by qualified persons. The subjects observed are usually maladjusted chil- 
dren, adults diagnosed as pathological, or nonclinical school children. 
These observations are compared with Rorschach findings in respect to 
certain aspects of personality, such as intellectual functioning, anxiety, 
emotional expression, etc. The observations may be made in a camp 


for children, a community or recreation center, in a school, or the like. 
The results of these studies a 


personalities of thirty childr 
their Rorschach responses. Th 
level of significance. But here 
not been nearly so favorable. 

MATCHING. A frequently used method of stu 
comparisons of Rorschach reports with clinical or other case reports. 
Some of the studies have used deviant personalities; others have used 
clinical groups; still others have equated two groups, except for one im- 
portant factor, the significance of which was to be tested (for example, 


red with those in institutions); some 


€ correspondence found was at the 1-percent 
again, other reports of similar studies have 


dying validity involves 
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Rorschach reports were correctly matched with descriptions of pupils in 
private schools, written by their teachers. The marked difference be- 
tween the findings of these two studies can be accounted for, in con- 
siderable part, by the fact that Krugman's cases were a heterogeneous 
group whose case reports were prepared by clinicial personnel, whereas 
the private school children were a relatively homogeneous selected group 
whose personality descriptions were probably on a behavioral level. In 
still another study, each therapist had to select the Rorschach report 
of each of his patients from among five others, so chosen as to be varied, 
but neither very similar nor very dissimilar. Correct selections were 
made in eleven of the twenty-eight cases. 

CHANGES AFTER TREATMENT. When Rorschach reports obtained be- 
fore and after treatment are compared, the hypothesis is that the test 
record should reflect personality changes that have taken place in the 
interim. 'Two studies—one involving psychoanalysis and the other in- 
volving insulin treatment—will be mentioned. In the first, there were 
thirty-six persons, for all of whom significant changes between the "be- 
fore" and the "after" Rorschach records were found; and these changes 
were related to the trends reported by the therapist: for example, im- 
proved emotional control, improved intellectual functioning (120, 154). 
On fourteen outpatient subjects, the Rorschach examiner and the thera- 
pist were in essential agreement regarding improvement. In the case of 
twenty-two hospitalized patients, the two clinicians agreed on ten in re- 
gard to direction and degree of improvement; for the remaining twelve, 
the therapist reported "social recovery" (a nebulous matter), whereas their 
Rorschach responses showed no improvement. 

When insulin treatment was used on schizophrenics, those who showed 
behavioral improvement also showed improved performance on the 
Rorschach: increase in speed of reaction and of response, improved verbal 
form and logical content, clearer perceptions and more relevant responses, 
and improved emotional control. Changes such as these are not precise 
and quantitative; but they may still be of clinical and behavioral sig- 


nificance. 
SEPARATE VARIABLES RELATED TO BEHAVIOR. Without discussion of the 


details and tentative findings in this area of validation, the kinds of stud- 
ies undertaken and the general conclusions will be indicated. This is 
called the "molecular" approach because, instead of making a global 
interpretation, only one or a few Rorschach variables are studied in 
relation to selected aspects of behavior. The Rorschach variables most 
frequently investigated have been color, movement, and form responses 
as they are related to certain aspects of behavior, such as introversion- 
extroversion (as measured by a personality inventory), intellectual ca- 
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pacity (as measured by a test of intelligence). On the whole, "he — 
have been inconclusive; statistically, they have not confirmed some o 
Rorschach's hypotheses regarding the differentiating significance of some 
determinants. 

EXPERIMENTAL VALIDATION. This method takes several forms. The 
subjects used may be influenced through experimentally induced ten- 
sions, hypnosis or drugs, brain surgery, or electric-shock treatment. When 
the Rorschach test is administered "before" and "after," it is possible 
to estimate the effects of the changed conditions. There are two prin- 
cipal obstacles to the employment of this method: first, the number of 
subjects is necessarily small in each study, because 
of time required by each and the difficult 
second, the paucity of scientific informati 
subjects treated by means of surgery, drugs a 


of the great amount 
y in obtaining subjects; and, 


with color and movement than with other variables. The purpose of the 
color experiments was to test the stimulus value of color in the cards and 
the hypothesis of “color shock.” Although the results have not been 
entirely consistent, most researches have rai 
the earlier conceptions of the role of colo; 
results, some clinicians point out that 
statistically relevant to the isolated stim 
sarily relevant to the interpretation of 
the whole, They emphasize, also, 
group trends but do not negate th 
color has a disruptive influence. 
The third ex 
Rorschach res 


What is the relationship between “color 


r. In response to the negative 
although the reported data are 
ulus of color, they are not neces- 
color responses in the context of 
that the statistical findings refer to 
€ fact that in some individual cases 


ationships between 
ivity. For example: 
Shock" responses and galvano- 
xperimentally induced stress on 
ental reports have demonstrated 


rary emotional states, 
SUMMARY. 
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are of three kinds, each differing only slightly from the other two: (1) 
intercomparisons of Rorschach responses of known groups; (2) "blind" 
diagnoses of individual cases, followed by comparisons with diagnoses 
otherwise determined; and (3) direct comparison, or matching, of an 
individual's Rorschach record with extensive clinical study and diagnosis, 
to determine areas of agreement and disagreement. If valid, the inkblot 
test can be used to facilitate a diagnosis in much less time, or to supple- 
ment or confirm a diagnosis arrived at by other means. Rorschach special- 
ists maintain that many investigations support the clinical value of their 
instrument. They have found that results obtained by the several meth- 
ods indicate the Rorschach test to be useful in revealing threatening or 
unwholesome trends in personality development before serious difficulties 
actually appear. If this predictive power of the instrument can be def- 
initely established, the Rorschach will become especially valuable in 
mental hygiene and preventive psychological treatment. For definite de- 
termination of its forecasting quality, an adequate number of longitu- 
dinal studies over an appreciable period of the subjects' lives will be 
necessary. Few of these are as yet available, since their achievement is 
beset by many difficulties (47, 48). 

Other Inkblot Tests. Several group techniques have been de- 
vised and are being applied for practical purposes while, at the same 
time, they are being subjected to experimentation and evaluation. 

In one instance, the Rorschach blots—on slides—are projected on a 
Screen, each for a specified time, the subjects being required to write 
out their responses. They are later asked to mark the blot reproductions 
On their blanks and to answer a series of questions so that their responses 
may be scored according to the standard categories. 

A second method differs from the preceding in that the subjects are 
provided with a list of responses for each blot from which to choose 
(multiple choice). A suggested modification of the multiple-choice item 
proposes that the several responses (some “normal” and some "neurotic") 
be rated by the subjects themselves in their order of appropriateness or 
likeness to the particular blot to which they are attached. The sum of 
the values assigned to the "neurotic" responses constitutes the score for 
€ach blot. Another group procedure being explored is to have the sub- 
jects self-administer the test, following prepared instructions. 

Group methods have been used with varied degrees of success and in- 
consistent results. On the whole, they appear to be less successful than 
the individual method, which permits free, unlimited response and dur- 
ing which the examiner is able to make observations of the subject’s be- 
havior and attitudes. While group tests can be used appropriately and 
Successfully in studies of group trends and group characteristics, the in- 
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dividual may be lost s'ght of in group findings; a pm 
need not necessarily conform to the trend; and there is overlapping 
scores and characteristics among groups. f = 

As is so often the case with projective tests, published research — 
group Rorschach tests are not consistently favorable Or pineal i 
reasons for the differences are not always evident, although poor "um 
design and faulty statistical analyses are often said to be the principa 
ones. Further research is necessary. Such research may reveal not only 
defects in the group Rorschach but may show, for example, that um 
cessful persons in some occupations represent a wide variety of personality 
patterns, or that reliabilities of the criteria (for example, foreman's 
ratings, ratings by deans of students) are too low. In the meantime, it 
can be said that group Rorschach methods have been used effectively by 
many specialists. 

The Holtzman Inkblot Technique is an interesting and possibly prom- 
ising variation on the original Rorschach method. This technique con- 
sists of two alternate forms, each of which has forty-five cards. The note- 
worthy innovation is that the subject is instructed to give only one re- 
sponse to each card. The principal advantages are said to be (1) the 
total number of responses is controlled and relatively constant. from 
person to person, yet the number is large enough to be significant; (2) 
intercomparisons of individuals are more meaningful because the num- 
ber of responses is relatively constant and large; and (3) alternate forms 
permit sounder estimates of reliability for each variable. 

Many of the reliability coefficients, for the separate variables, are high; 
others are moderate or low, as shown by the following data. 


Intrascorer reliability (72 university students) 


-89-.97 
Interscorer reliability (72 university students) 773-89 
Interscorer reliability (40 schizophrenic patients) -89-.99 
Internal consistency (odd-even cards) -00-.97 


Odd-even reliabilities were cal 


differing in age and occupati 
normal groups ( 


culated for ten normal groups of subjects 
onal status (N = 60-197), and for five ab- 
mental-hospital patients and mentally retarded; N = 41— 
99). The coefficients vary markedly among these groups and among the 


variables for which the test is scored. An evaluation of this inkblot test, 
therefore, must be made for se 


Regarding validity, 
*. . . indicate quite c 
Systems have a great 
of their respective v 
criteria of validity, 


« 
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[factors] of the Holtzman system have been amply documented, the 
external correlates are still relatively unknown" (p. 253) They add, 
however, that their data demonstrate highly significant group differences 
which are "generally consistent. with earlier findings using the Ror- 
schach” ? (p. 253). 
Evaluation of the Rorschach Test. The Rorschach inkblot 
method has been shown to have its greatest usefulness in revealing mark- 
edly deviant personalities. Its value in differentiating among individuals 
within the large groups in the middle ground between extremes is 
limited, partially for two reasons: (1) differences among individuals 
within the middle groups are not pronounced, hence they are more diffi- 
cult to measure or assess; and (2) the instrument is not sufficiently refined 
to detect finer differences. Other probable reasons are the lack of uni- 
formity in scoring, in determining interrelationships of scores of the 
several scoring categories, and in the psychological concepts on which 
various interpretations are based. It is pertinent to add, however, that 
often the Rorschach test and its specialists are asked by critics to reach 
higher levels of achievement in application and prediction than are im- 
posed upon other types of psychological tests. - 
Rorschach exponents recognize that the test needs further clinical and 
experimental rescarch, especially normative studies for age levels, sex 
membership, cultural and economic status. More longitudinal studies 


are also needed. To some extent, progress has been made in these areas 


(4) 5, 6, 17). . 
Adverse criticism of the Rorschach has been severe and, at times, hos- 


tile, Critics have dwelt on its inadequate objectivity, reliance on personal 
restriction to clinical use, and even "cultism." 
Although these criticisms are warranted to some degree, Rorschach ex- 
ponents themselves have not been unaware of the problems and the un- 
answered questions; for the literature contains many of their own critical 
publications on the test’s reliability, validity, clinical usefulness, guid- 


ance value, case studies, and prec 

M. L. Hutt has stated the pro 
at present, an art. Art, I believe, 
clinical practice. However, the scienti 


norms, limited validity, 


lictive value.!^ 

blem reasonably: "Clinical practice is, 
will always remain an integral part of 
fic aspects of clinical work will, in 


?'The Howard Inkblot Test, consisting of twelve cards, is available. A considerable 
amount of careful work has been done in arriving at the final selection of cards and in 
analyzing the responses of the more than 500 persons, presented in the manual. This test 
is not intended for use as an alternate to OT parallel with the Rorschach. The manual 
does not provide data on reliability or validity (internal or external), so that this test's 


Practical value cannot be evaluated. " ; 
"Sec also Buros, Fifth Mental Measurement Yearbook, for four critical reviews of 


the Rorschach, varying in attitude and substance. 
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time, become a major portion of this practice as theory, E M 
criteria reach a more mature state of psychological development. E d 
while, I frankly admit the importance of subjective norms, dinical juc (à 
ment, and subtle influences of intuition, These . . . must be ges fu 

play in clinical work in reaching working hypotheses about the [person] 
At the same time I rely as much as I can upon all the scientific A 
already have—norms, reliability and validity data, crucial empirica 
studies, and the like—and utilize these in refining my hunches and de- 
limiting them. Finally, I test these clinical hunches against biographical 
data, clinical behavior, and all other evidence accumulated about the 
[person]. In short, I recognize that as a clinician I have two roles to play: 
the artist and the scientist. I use the former in getting to know the [person] 
and use the latter to correct my impressions as well as I can. . . . Each 


role has its place and. . . We must be careful not to confuse the two 
(71; see also 43). 


Thematic A bperception Test 


Description and Procedure. 
this projective method consists of thi 
The cards are used in various combi 
age. Some are used with all subjects, 
Sex group or age group. The maximu 
subject is twenty, usually administer 
actual clinical practice, however, exa 
selected for the particular case, 


The person being examined is told that this is a test of imagination, 
that he is to make UP Stories to suit himself 


Commonly referred to as the TAT, 
rty pictures plus one blank card. 
nations, depending upon sex and 
while others are used with only one 
m number of pictures used with any 
€d in two sessions, ten each time. In 
miners frequently use only ten cards, 
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only are a product of his inner personality traits but may be a superficial 
reflection of cultural forces (radio, television, movies, comics, current 
events, reading materials, etc.). For instance, a 10-year-old girl made up 
an unexpectedly large number of stories dealing with crime and mystery. 
She had been listening regularly to a radio mystery serial, The frequent 
or compulsive utilization of recent environmental experiences, however, 
is considered significant in interpreting a subject's reports, because the 
person has utilized them as representing a conflict on the preconscious 
level, or as a symbol on the unconscious level. Although the TAT 
pictures are not unstructured to the same degree as inkblots, they are suffi- 
ciently ambiguous to permit a wide latitude for individual differences 
in responses. 

There is a basic difference between the TAT and the Rorschach, The 
latter is intended to reveal the structure and organization of an individ- 
ual’s personality, the former is devised to bring out primarily the content 


Fic. 25.3. A picture from the Thematic Apperception 
Test. Harvard University Press, by permission. 
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of one's personality: the drives, needs, sentiments, conflicts, complexes, 
and fantasies. The test is based upon the principle that when a person 
interprets an ambiguous situation, he is apt to reveal aspects of his own 
personality which he otherwise will not admit, or of which he is not 
aware. The individual, being absorbed 


construct an appropriate Story, becomes much less aware or quite un- 
aware of himself in the situation. In creating stories based upon ambigu- 
ous pictures, the individual organizes content of his own personal ex- 
periences. Everything he Says is regarded as having meaning. 

Analysis of Stories. Interpretation. of TAT stories may be 


made in one of several ways, depending upon the viewpoint of the ex- 
aminer and the Purpose of the i 


in the picture and attempting to 


aspects of different stories constitute a 
ake an analysis in accordance with the 
in accordance with modifications sug- 
analysis will be briefly outlined so that 


f stories be analyzed into (1) 


(2) the forces emanating from 
two divisions are analyzed under tł 


1. The hero: the character in each picture wi 
n whom the 


B to their Principal traits (for ex- 
P, superiority, and criminality), 
roes: analysis of everything each 


Noting especially the unusual, the 
high frequencies, the high and low intensities. Under this category, Murray * 


ve, on the basis of 


ample, solitariness, leadershi 


» noting uniqueness, intensity, ang f 


Tequency, as 
S not in the picture but invented p 


y the subject. 
effect —realized or 
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promised—upon the hero. More than thirty such forces have been listed 
(for example, rejection, physical injury, dominance, lack, loss). The strengths 
of these are rated on à scale of one to five. 

4. Outcomes: the comparative strengths of the forces emanating from the 
hero and the strengths of those from the environment; the amount of hard- 
ship and frustration experienced; relative degrees of success and failure; 
happy and unhappy endings. 

5. Themas: interaction of a hero's needs with environmental forces, to- 
gether with the successful or unsuccessful outcome for the hero, constitute 
a simple thema. Combinations or sequences of these are called complex 
themas. The thema, simple or complex, is a synthesis of the elements an- 
alyzed under the first four categories, the purpose being to view the several 
forces in their interrelationships and to determine the subject's most preva- 
lent problems arising from internal needs and external forces. 

6. Interests and sentiments: choice of topics and manner of dealing with 
them, displayed by the positive and negative appeal of various elements in 
the pictures (for example, older women who may be mother figures, older 
men as father figures, same or opposite sex). 


S. S. Tomkins has devised another scheme of analysis (149). He differ- 
entiates it from Murray's in this way: "Its rationale consists in tapping 
varying levels of abstraction in the hope that significant aspects of diverse 
types of protocols will be detected by the use of concepts which range 
from a level of broad generality to a high degree of differentiation." Each 
Story is scored under four main categories: vectors, levels, conditions, 
qualifiers. 

1. Vectors: the psychological direction of behavior, drives, and feelings. 
The ten vectors may have as their objects any thing, person, or idea of 
interest. Vector means a field of force, or magnitude and direction of force 
(for example, the vector "against," to attack objects; the vector "toward," 
to approach or enjoy objects). E. . 

2. Levels: the "plane" of psychological function involved in the story, 
Seventeen being listed (for example, object description, intention, wish, 
night dreams). . 

8. Conditions: any psychological, social, or physical state that is not in 
itself behavior, striving, or wish; that is, conditional qualities of behavior. 
Two major divisions have been made: (a) states with negative factors or 
forces—calléd valences: lacks, loss, danger, inner conditions (depression, 
anxiety); (b) states with positive or neutral factors: abundance, security, 
moderation, inner conditions (optimism, certainty). d 

4- Qualifiers: specific aspects of the first three categories: tempordl char- 
acteristics (past, present, future, duration of an episode); contingency (de- 
gree of certainty); intensity (strength of items in story); negation (any type 
of denial); subsidiation (any means-end relationship); causality (any causal 
relationship). 
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Any word or statement in the subject's stories may be classified under 
one or more of the major categories and their subdivisions. This method 
of analysis is exceedingly detailed and laborious; but its author states 
that it often yields insights not otherwise obtainable. 

Comparison of these two outlines and comparison of these with other 
schemes make it obvious that the TAT is not an objective test in the 
sense that tests of intelligence and specific aptitudes are; first, because 
the details of the stories are primarily evaluated and classified rather than 
scored; and second, because there are no uniform standards or criteria 
which can be objectively applied. We are not maintaining that an in- 
strument like the TAT should or can conform to 
criteria of objectivity used in scoring and reporting other types of tests. 
But the fact that the TAT does not so conform explains why its inter- 
preters report their results in different terms and why its use requires 
psychological insights on the part of interpreters, particularly into the 
psychology of needs and motivation. 

Regardless of the particular scheme use 
results are interpreted as representing, 


and traits of the subject's personality, belonging to his past or present, 
or projected into the future. The resu 


the standards and 


Reliability. TAT reliability has been studied in three ways: 
n Extent of agreement among interpreters of the same stories in regard 
to traits of the persons examined 


2. Similarities between stories 9n repeated examinations of the same 
persons 


3. Split-half method, correlating frequency and intensity of needs ex- 
pressed in the stories. 
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order correlation and the coefficient of contingency, have reported coeffi- 
cients ranging from approximately +.30 to +.90. 

When the method used was percentages of agreement among inter- 
preters, clinicians have agreed on interpretations of from 50 to about 
75 percent of the stories. In addition, there was essential, though not 
detailed, agreement in from 10 to 25 percent. On the remaining stories 
there was only partial agreement (44, 46, 94). 

In a variation of this method (21), the same judges scored and, after a 
period of six months, rescored the stories of a large group of college stu- 
dents on ten TAT variables: for example, achievement, aggression, au- 
tonomy. The average correlation between the two sets of scores was .89. 

Interjudge reliability (contingency coefficients) reported by Henry and 
Farley varied from .38 to .46; yet, in a pilot study, the coefficients were 
from .87 to .94. These appreciable differences might be explained by 
differences between the sets of judges, differences between the groups of 
subjects, and variable conditions under which the test was given. The 
authors conclude, on the basis of their detailed analyses, that they 
".. . are only justified in declaring . . . that the personality revealed by 
the TAT expresses the constant needs, dispositions, and attitudes of the 
subject. About the consistency of test findings from administration to 
administration . . . and about the consistency of interpretation . . . we 
can make no unequivocal statement" (56, p. 21). The Q-sort technique 
was used, in another study (36), to determine the correlations between 
ratings given by different interpreters to characterizations of the “hero” 
in TAT stories. The correlations varied from .37 to .88 for various proto- 
cols, with an average of .74.12 A. R. Jensen reports that in fifteen statis- 
tically sound studies of interscorer reliability, the coefficients range from 
-54 to .91, with an average of .71 (16, p. 164). 

'These data reflect, in part, differences in systems of analysis and dif- 
ferences in ability and experience of interpreters. The results of re- 
searches are influenced also by the complexity and often intangibility of 
elements in the stories. Considering the nature of the problem, the re- 
liability coefficients obtained by this method are encouraging. 

TrsrRrresr. Reliability data obtained by the test-retest method are 
affected by the stability of the personalities being examined and by per- 
sonality changes as a function of time. The greater the time interval, the 
lower reliability we may expect, because there will be more opportunity 
for the influence of intervening forces. Since it is expected that some 
Personalities often will change with time, as developmental conditions 
change, especially in the case of children, adolescents, and clinical sub- 
jects, the more significant reliability data are those based upon extent 


1 See also Little and Shneidman (92). 


644 RORSCHACH AND THEMATIC APPERCEPTION TESTS 


of agreement of judges and those of test-retest after a brief interval. 
Tomkins reports a reliability coefficient of +.80 after an interval of two 
months, for fifteen young women; 4-.60 after an interval of six months, 
using a different group of fifteen comparable subjects; and +.50 after 
ten months for a third group (149). While these last two coefficients in- 
dicate significant changes in ratings of some subjects, the indexes repre- 
sent group trends; they do not signify that all subjects showed important 
changes. It was found that the time intervals between tests had little or 
no effect upon the ratings of stable personalities. 

In another study (90), in which only four cards were used, the subjects 
were retested after two months. The range of reliability coefficients for 
fifteen variables was :54 to .91, while the average was Tu 

Spuit-HALF METHOD. Using the split-half method, Sanford reported 
reliability coefficients of 48 and .46. The responses were quantified by 
analyzing the stories for frequencies and rating intensities of "needs" and 
"press" elicited by the pictures (196 
ordinarily too low to be significant. Child et al. ( 
consistency reliabilities ran 
for ten of the variables. McClelland (95), on the ot 


21) report internal- 
with a mean of .13 
her hand, reports a 
' need, derived from 


Validity. Essentially, several methods of studyj lid- 
ity have been employed: ying TAT vali 
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1. Comparison with past histories and/or with results obtained through 
an intensive case study employing a variety of techniques 

2. Comparison of characteristics of known individuals or groups with 
their TAT records 

3- Comparison of TAT findings with other clinical materials: the sub- 
ject’s Rorschach record, dreams, or psychoanalytic interpretations 

4. Experimentally produced changes 
Since most of the studies do not fall exclusively in one category or an- 
other, they are not being presented according to this classification. 

Marcuinc. Of these, the first method has been explored more than 
the others. Harrison's studies are regarded as among the most significant 
(44). Forty patients at a mental hospital were given the TAT, without 
prior knowledge, on the part of the examiner, of their histories. On the 
basis of their stories and behavior during the testing, Harrison drew his 
inferences concerning the personality development, traits, attitudes, level 
of intelligence, personal problems, and conflicts of each subject. These 
inferences were checked by another person against the hospital records. 
A correlation of +.78 was obtained between estimated and obtained 
IQs; 82 percent of the inferences were correct; in somewhat over 75 
percent of the cases, diagnostic classification was correct, using the major 
Categories; when eighteen cases were classified into clinical subgroups, 
the percentage of agreement with clinical classification was 67. In order 
to eliminate inferences drawn from observation of the subject’s behavior 
during the testing session, Harrison had another examiner give the TAT 
to fifteen patients; then he made a blind analysis himself. In this instance, 
his inferences were 74 percent correct, when compared with already 
known biographical and personality data. This drop in correspondence 
(from 82 to 74 percent) indicates the value of using behavioral observa- 
tions in conjunction with projective-test results. 

In another matching experiment, an adaptation of the TAT was ad- 
ministered to groups of Navaho and Hopi Indians. On the basis of their 
Stories, blind interpretations were made of the personality traits of the 
people of these two cultures. Anthropologists familiar with the two In- 
dian societies found the personality analyses based upon TAT results to 
be in essential agreement with their facts concerning these Indian cul- 
tures, 

An investigation in India, using a modified and adapted form of the 
TAT, yielded promising results regarding the tests applicability to the 
Study of social problems in that culture. The findings make a positive 
contribution to the question of validity; for a high percentage of the 
Stories dealt with the basic problems of survival (Murray's "succorance"), 
intrafamily relationships (between the wife and the family of the husband 
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in particular; Murray's "submission"), and the need for sduuitian ane 
skills to improve one's lot (Murray's “achievement and acan ipuion J: 4 

Known Groups. Satisfactory results were obtained when the stories o 
diagnosed groups, of known characteristics, were analyzed in derail i. 
determine if significant differences existed among them. The resu ts 
showed that such differences exist among the following classifications, 
consisting of individuals who were relatively clear cases in each instance: 
conversion hysteria, anxiety hysteria, obsessive-compulsive neurosis, brain 
disease, and head-injury cases (27, 118, 131). 

Rapaport, Gill, and Schafer made a qualitative analysis of TAT re- 
sponses of clinically diagnosed individuals, They found trends in re- 
sponses that are significant for diagnostic purposes with groups such as 
the depressive, the paranoid, the schizophrenic (1 16). 

TAT responses were found to have considerable validity in the per- 
sonality studies of adolescent delinquents in a juvenile court (51). The 
TAT has been used, also, and with moderately favorable results, for the 
study of known nonclinical groups. Among these are the selection of 
potential leaders from among officer candidates in the armed forces (104) 


and the differentiating TAT responses of prejudiced and nonprejudiced 
persons (35). 


Comparisons WITH AUTOBIOGRAPHIES, 
raphies have been compared with TAT in 


strate that some of the pictures (about 30 | 
flected past history better th 


Data obtained from autobiog- 
terpretations. Findings demon- 
percent) elicited stories that re- 
an did others and that, in this respect, the 


acters with whom the 
3). Critics of this method believe that informa- 


€ subjects thus studied, the 


few reports available state that the arity was great enough 


to give added evidence of TAT validity (127). In this connection, the 
subjective character of dream interpretation and symbolism must be kept 
in mind. 


AGREEMENT WITH RORSCHACH FINDINGs. Some investigators have rē- 
ported sufficiently close agreement between the Rorschach and the TAT 


? These unpublished Studies were made b one of t : 
,, tu 
dents, a native and life-long resident of Inda ds ai bors former Sraduate s 
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as additional evidence of the validity of the latter (45, 56). This method 
must be evaluated in the light of the discussion of Rorschach reliability 
and validity. 

INTENSIVE Srupv OF INDIVIDUAL Cases. Morgan and Murray report 
that the TAT stories of one patient indicated all the major character- 
istics revealed by five months of psychoanalysis (98). This sucessful out- 
come is attributable in part to the fact that the psychologists making 
both the TAT and the therapeutic studies were psychoanalytically 
oriented. This fact does not necessarily minimize the findings; it empha- 
sizes the possibilities of the TAT when uniform analytical concepts are 
used with the test and with other forms of behavior. 

Tomkins made an intensive study of one person, consisting of seventy- 
five hours of psychological interview and testing. He concludes that the 
results of the intensive study disclosed no material inconsistent with his 
TAT stories and analysis (148). On the whole, it was found that the TAT 
and other methods supplement one another, that each contributes some- 
thing to an understanding of the personality not revealed by the others. 
The usual criticism of studies such as this is that the clinician's inter- 
pretations of the stories are “contaminated” by his knowledge about the 
individual obtained through interviews. 

EXPERIMENTAL CHANGEs. The use of experimentally produced changes 
as a technique is based upon the view that validity is shown to the de- 
gree that induced changes correspond to changes in TAT responses. In 
one experiment the need of "achievement" was selected for study with 
college students. Four pictures were used, administered under experi- 
mentally controlled conditions (95). Different subjects performed under 
conditions called “relaxed,” “neutral,” "failure," and “success—failure.” 
The "relaxed" state was created by telling the subjects that the test is 
merely experimental; the “neutral,” by urging them to do their best, 
though the test is experimental; the "failure," by creating a sense of 
failure on previous paper-and-pencil tests; the "success-failure," by creat- 
ing a sense of success and of failure on previous pencil-and-paper tests. 

The relaxed state was interpreted as being least motivating and the 
failure state most motivating, in regard to the need for achievement. 
When this hypothesis was tested relevant to achievement, significant dif- 
ferences under the four experimental conditions were found (at the 
5-percent level or better) for many of the categories: for example, in- 
crease in achievement reports, deprivation related to achievement, acting 
to achieve a goal, projecting a goal. 

It appears that temporarily induced ego-involving tasks can influence 
TAT records. There is a basic distinction, however, between experi- 
mentally produced, transitory conditions and their effects upon responses, 
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on the one hand, and actual, more or less durable personality eris i 
content, on the other. The TAT is intended to assess the latter e en r 
These experiments, therefore, have limited significance as a methoc Es 
estimating this instrument's validity. But they can indicate the te * 
degree of sensitivity to temporary and artificial conditions under whic 
a person performs.14 


While it is used Principally in studies and 
and abnormal persons, the TAT is also used 
selected normal groups, such as college students; groups having particular 
attitudes, such as racial, economic, or religious prejudices; cultures other 
than our own. In these areas, the TAT has contributed to a fuller under- 


principal value has been to pro- 
r sources of information. 


diagnoses of maladjusted 
with others: for example, 


The range of material in the test— 
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fields") in which the TAT is more or less effective. If such research proves 
fruitful, we should then be able to indicate validity not only in general 
but in particular, with respect to certain kinds of situations. 

It will be necessary, furthermore, to analyze TAT reports—in fact, all 
projective test responses—in the light of significant determinants of be- 
havior and personality development of the subjects. Such determinants 
are age, sex, special occupational training; and social, economic, and cul- 
tural status and values (broadly, one's caste and class status). Such analy- 
ses might reveal that scoring and interpretations should be modified on 
the basis of these determinants. It appears, for instance, that separate 
thematic apperception tests are desirable for children and adolescents. 
These are described in the next chapter.!5 
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PROJECTIVE METHODS: VARIOUS 


It was inevitable that a M 


ariety of projective devices should 
appear after the Rorschach and the 


TAT had found wide favor and use 
; some of these will be described and 
prehensive view and appreciation of 
all but one (word association) appeared sub- 
sequent to the Rorschach and the TAT. In addition, several other pro- 
jecti ribed, although they are not tests in the usual 
sense. However, they have 


been employed for many years (play, story- 
telling, and finger painting, for example) and have proved to be clini- 
cally useful. 


Word-Association Tests 1 


in psychological laboratories, With 
after 1900, the word-association me 
clinical technique. Jung and oth 
made extensive use of the techniqu 
plexes" (46). 

Jung's list of. one hundred words 
emotional "complexes." The subject i 


beginning about 1906, 
€ as a quick means of detecting “com 


was selected to represent common 
s told that the examiner will speak 
*A general evaluation of the tests discussed is at the end of the chapter. 
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a series of words, one at a time; after each word, the subject is to reply 
as quickly as possible with the first word that comes to mind; there is no 

right or wrong response. The examiner records the reply to each stimulus 

word, the reaction time, and any unusual speech or behavior manifesta- 
tions accompanying a given response. Replies to stimulus words that are 
emotionally toned for the subject generally have a longer rcaction time 
and may evoke physiological changes (in respiration, pulse, flushing, 
blood pressure), restless movements, coughing, laughing, and mild speech 
impediments, Jung believed that when a stimulus word was relevant to 
an emotional disturbance of the subject, an unusual response would be 
given. Content of the responses, reaction times, and attendant conditions 
were analyzed for the discovery of emotional tensions, inferred from 
the classes of words to which noteworthy responses were given. The in- 
ferences then are used to initiate further psychological exploration by 
interview. 

'The best known of the word-association tests is the one devised by 
Kent and Rosanoff to differentiate between the mentally ill and the 
normal (51). Unlike Jung, they used words which were not intended to 
indicate personal emotional problems but were neutral in character and 
were to provide diagnostic evidence on the basis of the proportion of 
common (normal) responses to the uncommon (abnormal). Determina- 
tion of normal and abnormal replies was based upon the frequencies of 
word associations of 1000 normal and about 250 psychotic subjects. The 
table of response frequencies provided the percentages of most common, 
less common, and uncommon replies. The percentage in each category, 
it was expected, might differentiate the normal from the abnormal. It was 
found, however, that word associations did not distinguish clearly 
enough between the two groups, although the results were at times useful 
as additional evidence in the study of a particular person. Part of the 
Kent-Rosanoff word list follows: 


table black sweet soldier justice 
dark mutton whistle cabbage boy 
music comfort woman hard light 
sickness hand cold eagle health 
man short slow stomach Bible 
deep fruit high stem memory 
soft butterfly working lamp sheep 
eating smooth sour dream bath 
mountain command earth yellow cottage 
house chair trouble bread swift 


Several other word lists are available. A recent one, which indicates a 
revival of clinical interest in word association, was constructed by Rapa- 
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port, Gill, and Schafer (73). It is intended for use as an aid in a 
and in estimating degree of maladjustment and impairment o pm p 
organization. This list is heavily loaded with stimulus wads a R 
analytic significance, especially in regard to psychosexual —— ; 
words and interpretations of responses are based upon an analysis 9 
what its authors believe to be the psychological processes involved in 
word association (construct validity). Their rationale includes De 
aspects of associative responses: (1) memory, as influenced by emotiona 
determinants that affect the process of associative recall; (2) concept 
formation relevant to the stimulus word; (3) anticipation, in terms of the 
popular character of the conceptual responses, based upon the subject's 
ability to adopt a "set" from the examiner's instructions and from the 
character of the stimulus words themselves, and upon his ability to 
produce an appropriate response. . 

A modification of the word association technique is the homographic 
free association test. A homograph is a word spelled exactly like an- 
ether, but with a different meaning and a different derivation. An ex- 
ample is the word “base,” which means "foundation" and, also, “wicked.” 
Thurstone's homographic test, for instance, uses a list of words, each of 
which has two frequent associations: a social, interpersonal one and à 
literal, physical one (103). The word "revolution" is an example. The 
subject is asked to respond to each word with a synonym or a short 
phrase. A person’s associative response to “revolution” can have social or 
physical relevance: namely, “a political upheaval” or “the turn of a 
wheel.” The purpose of such a restricted list of words would be to iden- 
tify individuals who are the more strongly oriented toward their social 
environment. This is shown by the number of responses that are essen- 
tially social in nature rather than physical or literal. Similar lists of 
homographs can be devised for other purposes as well. 

The many determinants that influence word associations must be taken 
into account in utilizing tables of res 
pretation of responses (72, 101 
plexes” and thought impair 
regional, cultural, and socio 
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persons. These Negroes responded with stories that were chiefly simple, 
matter-of-fact descriptions of the persons and objects in the pictures. 
They did not manifest empathy based on the interpersonal relations 
being portrayed. The value of the TAT depends upon the principle that 
extent and strength of identification derive from the number and kinds 
of symbolic elements in the picture, relevant or common to the person 
being examined. Hence, if a picture does not satisfy this principle, there 
will be little or no identification and the test will not yield the desired 
projective information. 

Most published research reports on this modification have not con- 
firmed Thompson's position that it has greater clinical value than the 
original TAT (22, 55; 77). As yet, however, there is too little research 
from which to conclude whether or not Thompson's modification is 
superior to the original for use with Negroes. It does appear that the 


Thompson version is useful in studying racial attitudes and notions of 


stereotypes in both white and Negro subjects. 

CHILDREN'S APPERCEPTION TEST. This test, for children of 3 to 10 
years, consists of ten pictures in which all characters are animals. These 
animals are shown in commonplace human situations: eating, sleeping, 
shopping, being punished, and so on. The assumption underlying the 
use of animal pictures is that children will more readily identify with 
them than with humans. In support of this, the authors cite Freud's 
widely known report on “The Phobia of a Five Year Old." 

The authors state in their manual: “The pictures were designed to 
elicit responses to feeding problems specifically, and oral problems gen- 
erally; to investigate problems of sibling rivalry; to illuminate the attitude 
toward parental figures and the way in which these figures are apper- 
ceived; to learn about the child's relationship to the parents as a 
couple . . ." (8). The pictures are intended to elicit, also, the child's 
fantasies regarding aggression and the adult world, and his methods of 
responding to and dealing with his problems of growth. 

The themes of the pictures are derived primarily from problems and 
relations suggested by psychoanalytic theory of development and be- 
havior. Interpretations of the stories rest upon the symbolic significance 
attributed to the content. Although psychoanalytic theory guided the 
authors of this test in its construction, it is possible to use other prin- 
ciples in the interpretation of responses. 

Ten supplementary pictures also have been published (7). These, the 
authors state, are designed to elicit themes “. . . not necessarily pertain- 
ing to universal problems, but which occur often enough . . . in a good 
many children." These pictures are intended to reveal the following, 


among others: fear of physical activity or of physical harm, interpersonal 
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problems in school, wishful fantasies about adulthood, ee 
cies, competitiveness, ideas of body image, fantasies or fears rel: 

r's pregnancy. e 
on once remains to be done on reliability and a i 
clinicians report that the CAT is a helpful tool in revealing some of the 
basic traits of a child, while the supplement can indicate transitory char- 
acteristics. These, then, can be further investigated by other procedures. 

Here, again, we have a useful instrument in the hands of skillful psy- 
chologists, but in regard to which several fundamental questions remain 
to be answered in addition to those of reliability and validity. What will 
adequate normative data show regarding types and frequencies of stories 
in relation to age, sex, and socioeconomic status (18)? Do children, in 
general, actually identify more readily with animals (10)? How are re- 
sponses influenced by intelligence level (47)? 2 

Symonps Picrure-Stupy Test. This set of 20 pictures is designed for 


use with adolescent boys and girls. The cards depict situations and inter- 
personal relations in which individuals at 


commonly find themselves. The stories told 
are analyzed for the psychological forces ind 
are among those commonly described in dy 


this stage of development 
in response to the pictures 
icated by them. The forces 
namic psychology: 

hostility and aggression 


ambition and striving for success 
love and erotism 


conflicts 
ambivalence guilt 
punishment guilt reduction 
anxiety depression, discouragement, despair 
defenses against anxiety happiness 
moral standards and conflicts sublimation 


The author’s norms indicate 
tained are concerned with fami 


up study gives support to this view 97): 
After an interval of thirteen y 8 ppo 1s ( 


à ears (1940-1955), twenty-eight of the orig- 
inal group of forty boys and girls (ranging in 


? Responding to pictures is a test item the I net scales and its revisions- 
At the age of 3 years and 6 months, only "enumeration" is normally expected. The suc 
cessive higher levels of response are “description” and “interpretation EIS ‘this is the 
course of development, “identification” may not be ex " 


levels. This view was supported by the findings of Kaake (47). 


long used in 
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estimate personality adjustment in later years from facts gathered about 
a person when he is adolescent" (97, viii). As statistical evidence, Symonds 
and Jensen report a correlation of .54 between estimates of adjustment 
made when the subjects were adolescents and those made thirteen years 
later. They recognize, however, that some of their findings should be 
regarded as tentative or "fruitful hypotheses.” Still, the evaluations of 
personalities based upon test-retest results indicate that this set of pic- 
tures is able to provide significant information (97). 

Tue Brackv Pictures (12). This set of twelve pictures, intended for 
ages 5 and over, is offered as a "measure of psychosexual development." 
They are used to learn the degree to which the subject has developed the 
various psychoanalytic "dimensions" of psychosexual traits. 'The pictures 
portray a young dog, Blacky, in situations intended to represent relevant 
experiences with three other dogs: his father, mother, and sibling. As in 
the case of the Children's Apperception Test, the Blacky pictures an- 
thropomorphize by presenting the animals in obviously characteristic 
human situations. The same questions that were raised regarding the 
CAT may be asked of the latter. The usual methods of estimating re- 
liability of projective tests were employed: matching, interrater agree- 
ment, and correlating rating and rerating of the "dimensions" of stories. 
On the whole, the results showed that the test yields results which are 
stable to a “modest degree" (38). 

Validation studies thus far published suggest "that there is something 
there . . . but do not necessarily indicate what it is or where it is." 
Validity is reported in several ways (70): 

Relationships within the theory itself; that is, construct validity: Statisti- 
cally significant correlations were found among “dimensions”; hence, it was 
concluded, they are related in some systematic manner. 

Responses analyzed for sex differences: Results were in the direction that 


would be predicted by psychoanalytic theory. 
Understanding clinical syndromes: Results were interpreted as a "valid 


indicator" of psychosexual deviation in a small, selected (clinical) popu- 


lation. 
The factor analytic approach: Male subjects produced results consistent 


with psychoanalytic theory, but the responses of females presented contra- 
dictions that could not be explained. 


The most prevalent current view is that the Blacky pictures are useful 
in studies of psychoanalytical theory; their use in clinics and in theoretical 
analyses of personality traits requires sound preparation in psychoanalytic 
theory; in individual cases, where relevant, they are useful to indicate 
aspects of psychosexual development to be investigated. 


? Also see symposium on validity (98). 
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Make-A-PictureE Story (88). This device, for use with adolescents and 
adults, combines one feature of the TAT (telling stories) with an in- 
novation: giving the subject an opportunity to construct his own pictured 
situation with the materials provided by the test. These materials consist 
of twenty-two "background pictures" (achromatic, on cardboard); some 
ambiguous, others semistructured, still others definitely structured. 
Among the backgrounds are, for example: 


living room street forest 
bedroom camp cave 
bathroom landscape schoolroom 


There are also sixty-seven cutout figures (sixty-five human and two ani- 
mal) representing: 


male adult indeterminate as to sex children 
female adult legendary and fictitious a dog 
minority groups Silhouette and blank faces a snake 


These figures are portrayed in a variety of postures and states. 

The examiner selects one background picture at a time, asks the subject 
to place one or more figures of his own choosing against the background, 
as they might appear in real life, and then to tell a story about the scene 
he has created. Any of the figures may be placed against any background. 
The principle here is that the individual may project whatever actions 
and relations he wishes upon whatever persons he chooses. 

After each story, the examiner conducts an inquiry, as in the case of 
the TAT. Complete responses are recorded and analyzed according to a 
rather detailed and elaborate scheme. Essentially, the stories are analyzed 
primarily for form and secondarily for content. The second of these— 
content—may be of almost unlimited variety, intended to elicit through 
these "fantasy productions" the subject's social adjustments and relations. 
By analysis of form, the test’s author means ". . . which figures are 
chosen, how many are chosen, where they are placed on the background, 
how they are handled by the subject, and what relationships they bear to 
each other." Form is analyzed and characterized in terms of forces operat- 
ing on the subject (goals, drives, conflicts, values, ego ideals) and of modes 
of behavior (hostility, sexuality, aspiration, autonomy). 

Some psychologists think that the stories evoked by this test may be 


more spontaneous and richer in fantasy than the TAT because the pic 
tured situation has been created by the subject himself (27). On the other 
hand, it is possible that the subject will evade certain types of situations 
and problems, significant in his case, with which he would have to deal 
in one way or another in the TAT type of test. 


PICTURE TESTS 665 


Tue MicHiGAN Picrure Test (2). This instrument is the most sys- 
tematically and solidly constructed thematic apperception test that has 
appeared since Murray's TAT. It is intended for use with children from 
8 to 14 years of age. The test consists of sixteen pictures, twelve of which 
are used with each child. The selection depends upon the subjects sex. 
The on principle is the same as that of the original TAT: revealing 
needs through responses to pictures. Construction of the Michigan test 
was motivated by reports from child guidance clinics that the TAT was 
not entirely suitable for children under 14 and that a special test was 


necessary. 
The authors state in their manual that the over-all purpose of their 
project was to “. . . investigate and measure the emotional reactions of 


children in the preadolescent and adolescent stages of development. . . . 
It was believed the test should be nontraumatic and yet tap the common 
conflict situations for this age group.” To do this, they followed the 
methods of population sampling, behavior sampling, and response analy- 
sis prescribed for and frequently found in the construction of tests of in- 
telligence and specific aptitudes. 

Results of the construction process indicated that certain test variables 
effectively discriminated between groups of well-adjusted and groups of 
poorly adjusted children. These variables constitute: (1) the “tension in- 
dex,” (2) “verb tense,” (3) “direction of forces.” The tension index is based 
upon the following four types of needs (2, p. 66): 


love: verbal expression, positive or negative, indicating affection, affilia- 
tion, attachment, friendship, or admiration 
extrapunitiveness: verbal expression of aggression toward an external 


object 
submission: verbal expression of defeat, resignation, passivity, compliance, 


obedience, acceptance of suffering without opposition 
personal adequacy: verbal reference, positive or negative, of happiness, 
strength, competence, or any reference to the temperament or physical char- 


acteristics of the human or animal figures in the story 


“Verbal tense" of responses is scored for frequencies of past, present, 
and future tenses. The hypothesis, empirical in origin, is that dispro- 
portionate emphasis upon each of the several tenses may indicate be- 


havior as follows: 


past tense: avoidance of conflict 
a regressive trend 
schizoid character structure 
submissiveness or isolation 
present tense: compulsivity or pedantry 
personality disturbance 
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effective intellectual functioning 

future tense: anxiety 
disturbed but relatively mature personality 
inefficient intellectual functioning 


“Direction of forces” refers to the action expressed in the story: 


centrifugal direction (outward action) 
centripetal action (inward action) 
neutral (no direction indicated) 


Well-adjusted children, it was found, express both centrifugal and 
centripetal directions much more frequently than the poorly adjusted. 

A second group of variables, for which the stories are also scored, con- 
sists of those “. . . for which trends were indicated, but in which dif- 


ferences were not statistically significant.” This group of variables in- 
cludes the following: 


psychosexual level: a measure of ps 
ian terms 

interpersonal relationships: ran 
relationships 

personal pronouns: frequencies of the thre 
of self-reference 

popular objects: most commonl 

level of interpretation: 
through enumeration a 
ference 


ychosexual maturity in orthodox Freud- 
ge and frequency of expressed interpersonal 


€ personal pronouns as a measure 


y referred to objects and persons 
degrees of interpretation, from "no response," 
nd description to complex interpretation and in- 


* Reliability correlations of the four variables constit: 
obtained by two judges, grade 3: .67, .93 93, 1.00; : 

l ges, * 77+ 93» 93, 1.00; grade 5: $i, 9». o7, o8: pra and 
9: .70, .81, .98, .98. Fifteen cases in each group. Reliability donde i inection of 


forces" as scored b i m 5 b. 
i cad inp, y two judges, grade 3: .95; grade 5: .87; grades 7 and g: .g1. Ten cases 


uting the “tension index” scores 
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is "Picture-Association Study for Assessing Reactions to Frustration.” 
Consisting of twenty-four cartoon-like pictures, the test is intended to 
serve as a projective method for revealing the subject's characteristic 
patterns of response to common stress-producing situations regarded as 
important in normal and abnormal adjustment. Two forms are available, 
one for children of ages from 4 to 14 years, and one for persons older 
than 14. 

Each picture shows two persons involved in a frustrating situation of 
common occurrence. One person in each picture is represented as making 
a statement which either helps describe the frustration of the second 
individual or is itself actually frustrating to the latter. The caption box 
above the second person is blank. The features and expressions in all 
pictures are omitted. 

The task of the subject is to examine each picture and to write in the 
blank box the first appropriate response that occurs to him. (Young 
children give oral responses to be written by the examiner. The as- 
sumption is that the subject identifies himself, consciously or uncon 
sciously, with the frustrated individual in each situation and that his 
replies are projections of his own ways of acting. Obviously, this picture 
test differs from others in being relatively structured and in serving a 
limited purpose. 

The situations presented are of two kinds: ego-blocking and superego- 
blocking. The first are those “in which some obstacle, personal or im- 
personal, interrupts, disappoints, deprives, or otherwise directly frustrates 
the subject." The second type of blocking "represents some accusation, 
charge, or incrimination of the subject by someone else." An inquiry 
follows the recording of responses, for clarification and elaboration. 

Scoring of responses is based upon (1) direction of aggression and (2) 
reaction type. Under the first of these, three forms of expression are dis- 
tinguished: (a) extrapunitiveness, in which aggression is turned onto the 
environment; (b) intropunitiveness, whereby the subject turns the aggres- 
sion upon himself; and (c) impunitiveness, in which aggression is evaded 
in an effort to gloss over the frustration. Each type of reaction also has 
three classes: (a) obstacle dominance, in which the barrier occasioning the 
frustration stands out in the response; (b) ego defense, in which the ego 
of the respondent predominates; and (c) need persistence, in which the 
subject emphasizes the solution of the frustrating problem. In addition to 
this analysis, the manual gives a sufficient number and variety of re- 
sponses and examples of scoring to provide this projective test with a 
relatively objective scoring system. 

The responses to the frustration pictures are intended to show the in- 
dividual's frustration tolerance, which signifies the relative absence of ob. 
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servable disorganization in response to frustrating situations; or adequacy 
and efficiency of response despite frustrations. Frustrations are common 
experiences; modes of adjustment to them are significant in understand- 
ing behavior and personality organization, since these modes reveal one's 
ways of coping with tensions. The Rosenzweig pictures are intended to 
evaluate an individual's tendency to blame the source of his frustration 
(extrapunitive), or himself (intropunitive), or to treat the situation im- 
personally (impunitive). 

Two questions have been raised re 
sponses. First, are the responses 
represent rather mild frustratin 
typical responses to frustrating 


garding the interpretation of re- 
given to the pictured incidents, which 
g situations, indicative of the subject's 
situations in actual life that are of basic 
significance to his personality? Second, do the subject's responses indicate 
what he actually would do, or what he thinks he should do, or feels he 
would like to do, but would not actually do? Although these questions 
have not been answered definitely, the results obtained with the test indi- 
cate either: (1) how the subject probably would act in these kinds of 


situations, or (2) what he knows or thinks would be expected of him in 
such situations. 


Although the Picture-Frustrat 
validity problems and deficiencies of m 


which subsequent interviews can be 
Tur Szonv1 TEsr. Devised by the psychiatrist Lipot Szondi, this test 
consists of six sets of facial pi 


ing: a homosexual, a sadistic 
an hysteric, a catato 


murderer, an epileptic, nic, a paranoiac, a depressive, 


and a manic. Deri (26) 
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subject's psychosexual development; selection of sadistic murderers in- 
dicates aggressiveness; epileptics, a violent quality of behavior and think- 
ing; hysterics, the emotional aspect of life; catatonics, the quality ot 
narcissism and withdrawal; paranoiacs, expansive forces and creative 
abilities; selection of manics and depressives, the person's “mood life.” 

The number and classification of choices (like or dislike) in each 
category, it is claimed, indicate strength of tendencies that are laten? in 
the person. If few or no choices are made in a category, the interpreta- 
tion is that the tendencies are already overt in the individual. 

According to the Szondi interpretation, therefore, every person has all 
the tendencies specified. If certain of the pictures are rejected altogether, 
the traits they are presumed to represent are said to be overt. Selected but 
disliked pictures mean, we are told, that the subject is repressing or sub- 
limating the tendencies which it is claimed they represent. Selected and 
liked pictures, we are told, represent tendencies that are acceptable to 
the subject's “superego” and available for expression. 

This test is based upon two unsound and generally rejected assump- 
tions. The first is that photographs can represent and differentiate "'per- 
sonality types." The second premise is that a person's overt behavior 
derives from his inherited genetic elements, that his selections are “in- 
stinctively" determined and represent latent forms of behavior which seek 
expression. This assumption is contrary to the current views of the roles 
of heredity and experience in personality development. It is also con- 
trary to modern principles that have radically modified the concept of 
instinctive behavior. Thus far, a fairly large number of studies on the 
Szondi have been reported. Their findings and conclusions are almost 


uniformly negative (13, 21, 28, 37, 39)- 
Verbal Completion Tests 


SENTENCE-COMPLETION Tests. These are tests that present the indi- 
vidual with a series of incomplete sentences, generally open at the end, 
to be completed by him in one or more words. They resemble the word- 
association test in that the word or words used to complete the sentence 
follow from and are associated with the given part of the sentence. The 
sentence-completion test, however, is regarded as superior to word-associa- 
tion because the subject may respond with more than one word, a greater 
flexibility and variety of response is possible, and more areas of person- 
ality and experience may be tapped. 

'The content of a particular test and the nature of the sentences will 
depend upon the group of persons and the purposes for which they are 
intended. One test may be devised to learn about satisfactions and an- 
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noyances, likes and dislikes, fears and attractions. Another may be B 
signed to reveal motives, needs, and environmental forces. SH. anat 1er 
may be concerned primarily or solely with feelings about one's home 
community, ‘friends, occupation, school. Others may be intended to dis- 
cover psychological mechanisms, such as feelings of rejection, evidence of 
rationalization, methods of evasion. In other words, each sentence- 


completion test should be adapted to the particular situation in which it 
is to be applied. 


Several sample items are the following: 


I worry over . . , 

I feel proud when . . . 

Other people usually . . , 

I prefer to... 

My father used to . 

My hope is . . 

When I was a child . . , 

Two fully described and frequently used 

Sentence Completion Test and The Rotte 
Each is designed to estimate the subject's d 
ment, if any exist. Although both provide s 
appears that this type of test is most use 


havioral problems and for providing diagnostic clues. For example, the 
responses of a 10-year-old boy to one set of completion sentences were all 


neutral and indicative of satisfactory adjustment except those related to 
his school experiences. 


On the whole, 
sonality materials 


instruments are. Rhode's 
r Incomplete Sentence Blank. 
legree and areas of maladjust- 
chemes for scoring responses, it 
ful for identifying areas of be- 


completion tests evoke per- 
vel of awareness than those 
€ apperception type (17, 40): 
pletions provide a basis for subsequent in- 
terviews and counseling, 


jective questionnaire lies, rather, in thé 


——— 0 
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fact that the answers are interpreted as revealing certain traits and serve 
as a basis for psychological interview. 

The following are items included in the questionnaire devised by the 
psychological staff of the Office of Strategic Services during World War II, 
to be used, among other tests, in the screening and job assignment of 
personnel (see Chapter 22). 


It scems that no matter how careful we are, we all sometimes have em- 
barrassing moments. What experiences make you feel like sinking through 
the floor? 

What kinds of things do you most dislike to see people do? 

What things or situations are you most afraid of? 

If you were (are) a parent, what things would you try to guard your chil- 
dren against most carefully? 


STORYTELLING AND STORY COMPLETION (11, 86). Although this method 
has been used informally for some years, the published work on it is 
meager. Several approaches have been employed, the most common with 
children being the following: 


Retelling of children's popular stories, such as “The Three Bears." 

Retelling the story the child likes best of all those ever heard or read. 

Stories made up on specified themes, such as on a boy or girl, a father, a 
mother. 

A story especially constructed for the purpose at hand, told to the children 
by a teacher; reproduced in writing for the teacher; later retold to the 
therapist, this being the "emotional version." In evaluating such retold 
stories, the therapist must take into account the common phenomena of 
memory distortions (6). 


Users of these methods report that they are helpful in revealing a child's 
conflicts, aggressions, anxieties, wish fulfillment, affectivity, and so forth. 
Among other methods that have been tried are these: 


A situation is set up creating a moral conflict, the effects of which on the 
child are evaluated through the story he makes up about it and his methods 
of dealing with the conflict. 

The child is asked to invent stories about favorite comic-strip characters; 
and these stories are regarded as indications of his "retreat into fantasy," 
on the one hand, or his realistic attitude toward his environment and inter- 
personal relations, on the other. The child is also asked to invent stories 
about disliked comicstrip characters as a means of discerning his personality 
tensions and for subsequent use in therapy, since these stories give evidence 
of emotional transference and are a means of communicating affectivity. 


Storytelling as a projective method has been tried also with adults, 


Ordinarily the subject is asked to develop a story on a theme dealing with 
some aspect, or aspects, of his environment in order to bring out some of 
his strivings and modes of adjustment. 
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A little work has been done on combining music and storytelling. Record- 
ings of selected musical classics are played; the subject is told to report the 
images and themes he associates with these. The expectation is that the 
respondent's reports will indicate certain psychological states, such as fear, 
struggle, romance, animation, reverence, and others. 


Story-completion techniques are of several kinds. 


Dramatic situations are presented briefly and concisely; the subject is re- 
quired to develop each into a skeleton of a short story. 

The outline of a story is provided; it is the subject's task to write a narra- 
tive based upon it. 

A brief incomplete story is presented, which the subject is to complete. 

The individual is given a group of brief statements each of which presents 
a situation of emotional conflict, followed by two questions on how the 


central figure in the situation responded to it: "How did he feel?" “What 
did he do and why?" 


An elaboration and systematic development of story completion has 


been offered by H. D. Sargent, called The Insight Test, for men and 
women (86). 


[It] . . . is composed of a series of items . . . in which the bare outlines 
of a problem situation are stated and to which the subject is asked to re- 
spond by telling what the leading character did, and why, 
about it. The nature of the material and the task itself, 
as a test of insight or ability to "see into" the motives, 
of others is designed to conform to two b 
ambiguity of stimulus to which the subject must respond in his own per- 
sonal terms, and direction of attention away from concern with self toward 
the task of reacting to something external. 


and how he felt 
which is presented 
actions, and feelings 
asic principles of a projective test: 


This story completion test has the merit of 


/ ) t providing a system of analy- 
sis and scoring that yields interpretations based upon a set of common 
principles. 


[The scoring system] . . . involves primarily a differentiation between ex- 
pressions of feeling in response to the [items], and other types of expression 
which are regarded as serving the purpose of control, defense, and delay of 
unmodulated emotion. Phrases . . . are categorized both with reference to 
the fate of the aroused affect and to the mood or content expressed. Several 
indications of disturbance in thought or feeling are also recorded. Finally, 
certain relations between expressions Supposed to represent affect and de- 


fense, and within the quantities of different kinds of expression standing for 
different sorts of discharge and control, are computed. 


As a testing technique, storytelling and Story completion present dif- 
ficulties in evaluation and interpretation; for, with the exception of 
Sargent's test, the situations are almost completely unstructured and fiuid, 
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and the possibilities of categorical analysis are almost unlimited. At 
present these techniques must be regarded as useful principally for ex- 
ploratory purposes when viewed against a background of other informa- 
tion on the individual's experiences, traits, and behavior. Some clinicians 
who have used the Insight Test believe it has possibilities not yet demon- 
strated by research and practical applications. 


Drawing and Painting 


If volume of publication is a criterion, then it may be said that 
drawing and painting (including finger painting) are of increasing in- 
terest and significance as projective methods. Psychological studies have 
dealt with the relations of art to chronological age, general intelligence, 


Fic. 26.1. Finger painting. “Sometimes it bleeds! 
Look!” (Indicates paper painted red.) “Look! 
Bloody! Like my throat!” 

“He smears his hands and arms through the finger 
paint. And as he pins his thoughts and feelings 
down on paper he feels, perhaps, more secure. After 
he has captured them on paper he can handle them 
a little better.” From V. M. Axline, Play Therapy, 
Boston: Houghton Mifflin Company, by permission. 
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specific aptitudes, and cultural influences. In these studies, the art pro- 
ductions of normal and abnormal groups have been analyzed to discover 
if there are characteristics that differentiate among normal persons and 
various atypical groups. In other words, do the productions have diag- 
nostic value? It was thought, also, that drawings and paintings might 
contribute information to descriptions of normal personalities. In studies 
of psychotic groups, for example, it is reported that, among other things, 
their productions are markedly similar to the draw 
some primitives; pronounced stereotypes 
representation is extremel 


ings of children and of 


draw a person of the opposite sex. Com- 

wed by an inquiry, in whi h abiectis 
k 3 , Y, 1n which the subject 

asked Hx tell a story about each person" he has drawn, as though that 


» à reflection of self-regard. In drawing 
» lt is held, the subject becomes ego- 
nsions are expressed through details and 


OTHER NOTEWORTHY PROJECTIVE TESTS 675 


The nature of the analysis can be better appreciated by noting the 
structural features that are taken into account: size, background, exact- 
ness, degree of completion, detailing, symmetry, mid-line emphasis, per- 
spective, reinforcements, proportions, placement on the page, theme, 
shading, erasures. Content of the drawing is also analyzed: individual 
body parts, clothing, accessories, facial expression, posture. Esthetic qual- 
ities are not considered, 

Several body parts are listed below to illustrate interpretive significance 
attached to them: 

« head: intellectual aspirations, rational control, fantasy elaboration of the 
personality 

eyes: uncertainty, paranoid wariness, sexual appeal 


nose: masculinity and assertiveness 
arms and hands: the primary extensive organs; indicators of degree of power, 


degree of reaching out, extent of environmental manipulation 
bilateral symmetry: degree of obsessiveness-compulsiveness 
body points: exaggerated somatic preoccupation 
For a number of years, Karen Machover has used and experimented 
with this device. She has made acute and insightful personality inter- 
pretations with it. Other clinical psychologists, also, report that they have 
made effective use of the technique. The validation of this projective 


method, however, awaits more extensive study. 


Other Noteworthy Projective Tests 


In a textbook, it is impossible to describe and discuss every 
available test, since the number is so large, and many of them have not 
eir initial stages. There are a few, however, 
that are regarded by critics as having promise because they are psycho- 
logically well conceived, they have proved useful in practical situations, 
and results of research warrant further experimental and applied study. 
Among them are the following. 

Four-Picture Test (van Lennep) consists of four vaguely drawn col- 
ored pictures. The subject is required either to write or to tell one in- 
tegrated story, using all four pictures. The pictures represent scenes of 
social groups and of individuals in isolation. This test has undergone a 
long period of development; careful analyses have been made of the 
probable psychological significance of responses. 

The Lowenfeld Mosaic Test uses 456 small wooden (or plastic) squares, 
diamonds, and triangles of varied colors and sizes. The subject is in- 


progressed much beyond th 


5 Buros' Fifth Mental Measurements Yearbook includes 51: projective tests that are 


available in English-speaking countries. 
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structed to make anything he likes with the pieces. Patterns are classified 
as representational, conceptual, or abstract in design. They are then 
analyzed as multiform, composite, diffuse, and collective. Further com- 
plex analyses are made regarding many details and approaches to the 
problem for the purpose of differentiating among several age groups and 
personality categories. This test's complex and laborious method of 
analysis is a drawback to its wider use. 

Raven's Controlled Projection for Children, of ages 6 to 12, combines 
drawing and storytelling. “The method consists of asking the child to 
draw [anything he wants to 
describe a series of events 
The verbal record shows the content of the child's thought and the or- 
ganization of his ideas at the time of the test, while the drawing shows 


ity. Raven's purpose is ". . . to 


social, and clini 
The Tomkins-Horn Picture Ar Pius psychology. 


rangement Test consists of 25 plates, 
- The subject is asked to indicate the 
ngs should be arranged to make "the 


Clinical test, 


The House-Tree-Person Projective Technique (14), for ages 5 and 
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older, requires the subject to draw a house, a tree, and a person, in that 
order. While the drawings are being made, the examiner takes notes on 
sequence of detail, tempo, spontaneous comments, and general behavior. 
In a second session, an extension initiated by J. T. Payne, the subject is 
asked to use colored crayons in making drawings of a house, tree, and per- 
son. A planned interview, including a set of standardized questions, fol- 
lows the completion of the drawings. The purpose of the interview is to 
provide insights into various aspects of the drawings by having the subject 
describe, clarify, and interpret the objects. At the same time, the inter- 
view provides an opportunity for the examinee to free-associate. The 
methods of scoring and making qualitative interpretations of the draw- 
ings and responses are complex; but the objectives and personality traits 
being evaluated are similar to those of other projective tests: discern- 
ment of affective tone, quality of verbalizations, drive, psychosexual level, 
reactions to one's environment, interpersonal relations, intrapersonal 
balance, major needs, and major assets. The qualitative analysis is not 
based upon a single, theoretical system; it utilizes Freudian, neo-Freudian, 
and other concepts. Since the author of this instrument believes that each 
drawing arouses both conscious and unconscious associations, each of the 
objects presumably has symbolic and literal significance. The house re- 
lates to the subject's home and those living with him; the tree—highly 
symbolic—concerns his “life role" and his ability to "derive satisfaction" 
from his environment; the person represents his general and specific inter- 
personal relations. 

It is obvious that a large part of the information on these aspects of 
personality would be derived from interpretations of answers to the stand- 
ard questions in the interrogation. Clinicians who have had considerable 
experience with the H-T-P technique have found it useful especially with 
children; but its usefulness is largely a matter of individual and intuitive 
insights gained from varied experience with it. Other clinicians believe 
the technique is too time-consuming to administer and analyze and that 
equally or more valuable results may be obtained with other less complex 


instruments. 


Play 


Since play is free of the constraints of ordinary adult activity 
and free of those imposed by adults upon children, it is useful as a projec- 
tive technique in the study of less apparent aspects of personality. For it 
is unstructured, provides opportunity for fantasy and imaginativeness to 
operate, and gives scope to individuality of expression. 

As a method of personality diagnosis and therapy, play is used almost 
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solely with children. It was first tried as a substitute for scat t 
technique of psychoanalysis. From that. practice, jue ne > € E 
the theory that play activity is determined by a complex o ae ii be 
provides an outlet for the release of emotional tensions and for ove = 
symbolic behavior expressing needs, wishes, desire for experience, 
attitudes, without fear of censure or punishment. 


Fic. 26.2. Doll 


play. She began to pl 
“This is Tom,” she said to the ther. 
Want to know what ha 
Therapy, 


ay with the family of dolls. 
apist. "Tom is a funny boy. 
Ppens to Tom?” From y. M. Axline, Play 
Boston: Houghton Mifflin Company, by permission. 


furniture (especially 
; Water, sand, vehicles, animals, build- 


ing blocks, balloons, sticks, or any other objects that might be relevant in 


a particular instance. 


With play, as with diawing and painting, methods of analysis and in- 
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terpretation of behavior differ according to the theoretical position and 
assumptions of the investigator or therapist. However, while the play tech- 
nique of personality study is a subjective method, numerous case reports 
attest to its value as a diagnostic and therapeutic procedure, 

An effort has been made to standardize and objectify the evaluation of 
children's play in the therapeutic situation (16). "This device, called the 
World-test, utilizes numerous miniature pieces that represent a large 
variety of objects and. persons in children's environments. The child is 
instructed to create any setting or situation he chooses. His performance 
and emotional condition are interpreted. in terms of the number and 
variety of pieces used, the structure of the situation (rigid, flexible or- 
ganization, chaotic), aggressiveness, etc. (depending upon which pieces 
are used and how they are used). Although this technique is more for- 
malized and has some objective aspects, the basic value of the method de- 
pends upon the interpretation of the child's performance and upon 
demonstrable relationships between this play activity and children's prob- 
lems of adjustment. 


Evaluation of Projective Tests 


Since the Rorschach and the Murray TAT are the two most 
widely used projective tests and have been subjected to the most thorough 
research, they were described and explained in some detail. The other 
projective tests were presented to acquaint the student more fully with 
Psychological thinking in this area, to show how this thinking is being 
implemented, and to indicate the variety of instruments. 

The following paragraphs summarize the present status and the major 
problems of current projective tests. 


The Rorschach and the TAT have proved to be of considerable value, 
even though much research remains to be done to fulfill all requirements 
of a soundly standardized test. It should be recalled, however, that many 
Specialists maintain that the usual specifications of Standardization cannot 
and should not be applied to projective tests. 

Although progress is being made, adequate normative data are lacking for 
Projective tests, even for the Rorschach and the TAT. In some instances, 
however, no normative data are available, Response norms, both quantita- 
tive and qualitative, should be obtained for various groups, according to age, 
Socioeconomic status, sex, educational levels, and clinical Classification. 
More emphasis upon normal populations is needed to provide the proper 
Perspective within which to view performances of clinical Broups. The views 
and interpretations of clinicians have been too strongly affected by their 
contacts with atypical personalities. 

Reliability studies in terms of agreement among several scorers are most 
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significant for these instruments and on the whole have yielded encouraging 
results. Test-retest reliabilities are of limited value and present special dif- 
ficulties, as already pointed out. The method of split-half reliability is gen- 
erally inappropriate for the reasons indicated. f 

Validity. študies of the matching type have proved to be the most satis- 
factory kind. The use of known groups has been only moderately satisfactory 
because the criteria are qualitatively and, to some degree, differently defined. 
Furthermore, the use of these definitions in actual classification is a subjective 
matter. The relative unreliability of psychiatric classifications has already 
been noted. Validity studies are often complicated by the fact that inter- 
pretation of test data is “global” rather than in terms of limited and meas- 
urable elements. Laboratory studies of experimentally induced states or 
“sets” may be seriously questioned as a validating method; for they do not 
deal with basic and established personality traits. A possible exception is the 
use of hypnosis and similar means. 

Although instructions and procedures in administering a test do not v 
radically among specialists, uniformity is highly desirable. 

An examiner’s attitude (for example, reassuring or neutral) and his rela- 
tions, in general, with the subject can influence the individual's responses. 
If rapport is established with the examiner, the subject is encouraged to 
verbalize and understand his behavior, attitudes, values, and the environ- 
mental forces. 


More nearly uniform and objective scoring Systems should be developed. 
Since interpretation must be based upon a theory and an understanding of 
the dynamics of human behavior, different theories lead to varied interpreta- 
tions. 


ary 


More experimental research remains to be done on the nature of the psy- 
chological processes involved in some of the projective tests. 


Appropriately graded projective techniques are of more interest to subjects 
than are personality inventories. 


Malingering and falsification are m 
ventories, though not impossible. 
Projective methods—parti 


ore difficult tha 


Projective methods alone are not the answer to all questions regarding 
human personality and adjustment—a claim made for them by some 
enthusiasts. But in the hands of a i : 


through extended psychological int 


Interest in projective methods is widespread among psychologists, from 
the viewpoints of experimentalists, personality theorists, and clinicians 
dealing with individuals who present every gradation and variety of prob- 
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lem (not necessarily maladjustment) from childhood through adulthood. 
It is to be expected, therefore, that solutions of the problem of projective 
methods will, with time, be more nearly achieved and the inadequacies 
of the instruments reduced.® 


If the present trend continues, we may expect to have a number of 
separate projective tests—especially of the thematic picture type—de- 
signed for limited age levels, providing relationships and situations sig- 
nificant for that group. The large crop of new devices, among them being 
some that are promising and insightful, will be “shaken down"; and the 
“chaff” will be separated from the "wheat," 7 
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